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Preface 


This is an exciting time to be working in speech and language processing. 
Historically distinct fields (natural language processing, speech recognition, 
computational linguistics, computational psycholinguistics) have begun to 
merge. The commercial availability of speech recognition, and the need 
for web-based language techniques have provided an important impetus for 
development of real systems. The availability of very large on-line corpora 
has enabled statistical models of language at every level, from phonetics to 
discourse. We have tried to draw on this emerging state of the art in the 
design of this pedagogical and reference work: 

1. Coverage 

In attempting to describe a unified vision of speech and language pro¬ 
cessing, we cover areas that traditionally are taught in different courses 
in different departments: speech recognition in electrical engineering, 
parsing, semantic interpretation, and pragmatics in natural language 
processing courses in computer science departments, computational 
morphology and phonology in computational linguistics courses in lin¬ 
guistics departments. The book introduces the fundamental algorithms 
of each of these fields, whether originally proposed for spoken or writ¬ 
ten language, whether logical or statistical in origin, and attempts to 
tie together the descriptions of algorithms from different domains. We 
have also included coverage of applications like spelling checking and 
information retrieval and extraction, as well as to areas like cognitive 
modeling. A potential problem with this broad-coverage approach is 
that it required us to include introductory material for each field; thus 
linguists may want to skip our description of articulatory phonetics, 
computer scientists may want to skip such sections as regular expres¬ 
sions, and electrical engineers the sections on signal processing. Of 
course, even in a book this long, we didn’t have room for everything. 
Thus this book should not be considered a substitute for important rel¬ 
evant courses in linguistics, automata and formal language theory, or, 
especially, statistics and information theory. 

2. Emphasis on practical applications 

It is important to show how language-related algorithms and tech¬ 
niques (from HMMs to unification, from the lambda calculus to 
transformation-based learning) can be applied to important real-world 
problems: spelling checking, text document search, speech recogni- 


xxi 



XXII 


Preface 


tion, Web-page processing, part-of-speech tagging, machine transla¬ 
tion, and spoken-language dialog agents. We have attempted to do this 
by integrating the description of language processing applications into 
each chapter. The advantage of this approach is that as the relevant 
linguistic knowledge is introduced, the student has the background to 
understand and model a particular domain. 

3. Emphasis on scientific evaluation 

The recent prevalence of statistical algorithms in language processing, 
and the growth of organized evaluations of speech and language pro¬ 
cessing systems has led to a new emphasis on evaluation. We have, 
therefore, tried to accompany most of our problem domains with a 
Methodology Box describing how systems are evaluated (e.g. in¬ 
cluding such concepts as training and test sets, cross-validation, and 
information-theoretic evaluation metrics like perplexity). 

4. Description of widely available language processing resources 
Modem speech and language processing is heavily based on com¬ 
mon resources: raw speech and text corpora, annotated corpora and 
treebanks, standard tagsets for labeling pronunciation, part of speech, 
parses, word-sense, and dialog-level phenomena. We have tried to in¬ 
troduce many of these important resources throughout the book (for ex¬ 
ample the Brown, Switchboard, callhome, ATIS, TREC, MUC, and 
BNC corpora), and provide complete listings of many useful tagsets 
and coding schemes (such as the Penn Treebank, CLAWS C5 and C7, 
and the ARPAbet) but some inevitably got left out. Furthermore, rather 
than include references to URLs for many resources directly in the 
textbook, we have placed them on the book’s web site, where they can 
more readily updated. 

The book is primarily intended for use in a graduate or advanced under¬ 
graduate course or sequence. Because of its comprehensive coverage and the 
large number of algorithms, the book it also useful as a reference for students 
and professionals in any of the areas of speech and language processing. 

Overview of the book 

The book is divided into 4 parts in addition to an introduction and end matter. 
Part I, “Words”, introduces concepts related to the processing of words: pho¬ 
netics, phonology, morphology, and algorithms used to process them: finite 
automata, finite transducers, weighted transducers, N-grams, and Hidden 
Markov Models. Part II, “Syntax”, introduces parts-of-speech and phrase 
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structure grammars for English, and gives essential algorithms for process¬ 
ing word classes and structured relationships among words: part-of-speech 
taggers based on HMMs and transformation-based learning, the CYK and 
Earley algorithms for parsing, unification and typed feature structures, lex- 
icalized and probabilistic parsing, and analytical tools like the Chomsky 
hierarchy and the pumping lemma. Part III, “Semantics”, introduces first 
order predicate calculus and other ways of representing meaning, several 
approaches to compositional semantic analysis, along with applications to 
information retrieval, information extraction, speech understanding, and ma¬ 
chine translation. Part IV, “Pragmatics”, covers reference resolution and dis¬ 
course structure and coherence, spoken dialog phenomena like dialog and 
speech act modeling, dialog structure and coherence, and dialog managers, 
as well as a comprehensive treatment of natural language generation and of 
machine translation. 

Using this book 

The book provides enough material to be used for a full year sequence in 
speech and language processing. It is also designed so that it can be used for 
a number of different useful one-term courses: 


NLP 

1 quarter 

NLP 

1 semester 

Speech + NLP 

1 semester 

Comp. Linguistics 

1 quarter 

1. Intro 

2. Regex, FSA 

8. POS tagging 

9. CFGs 

10. Parsing 

11. Unification 

14. Semantics 

15. Sem. Analysis 
18. Discourse 

20. Generation 

1. Intro 

2. Regex, FSA 

3. Morph., FST 

6. N-grams 

8. POS tagging 

9. CFGs 

10. Parsing 

11. Unification 

12. Prob. Parsing 

14. Semantics 

15. Sem. Analysis 

16. Lex. Semantics 

18. Discourse 

19. WSDandIR 

20. Generation 

21. Machine Transl. 

1. Intro 

2. Regex, FSA 

3. Morph., FST 

4. Comp. Phonol. 

5. Prob. Pronun. 

6. N-grams 

7. HMMs & ASR 

8. POS tagging 

9. CFG 

10. Parsing 

12. Prob Parsing 

14. Semantics 

15. Sem. Analysis 
19. Dialog 

21. Machine Transl. 

1. Intro 

2. Regex, FSA 

3. Morph., FST 

4. Comp. Phonol. 

10. Parsing 

11. Unification 

13. Complexity 

16. Lex. Semantics 

18. Discourse 

19. Dialog 


Selected chapters from the book could also be used to augment courses 
in Artificial Intelligence, Cognitive Science, or Information Retrieval. 
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Dave Bowman: Open the pod bay doors , HAL. 

HAL: Vm sorry Dave , Vm afraid I can *t do that. 

Stanley Kubrick and Arthur C. Clarke, 
screenplay of 2001: A Space Odyssey 


The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space 
Odyssey is one of the most recognizable characters in twentieth-century 
cinema. HAL is an artificial agent capable of such advanced language¬ 
processing behavior as speaking and understanding English, and at a crucial 
moment in the plot, even reading lips. It is now clear that HAL’s creator 
Arthur C. Clarke was a little optimistic in predicting when an artificial agent 
such as HAL would be available. But just how far off was he? What would 
it take to create at least the language-related parts of HAL? Minimally, such 
an agent would have to be capable of interacting with humans via language, 
which includes understanding humans via speech recognition and natural 
language understanding (and of course lip-reading), and of communicat¬ 
ing with humans via natural language generation and speech synthesis. 
HAL would also need to be able to do information retrieval (finding out 
where needed textual resources reside), information extraction (extracting 
pertinent facts from those textual resources), and inference (drawing con¬ 
clusions based on known facts). 

Although these problems are far from completely solved, much of the 
language-related technology that HAL needs is currently being developed, 
with some of it already available commercially. Solving these problems, 
and others like them, is the main concern of the fields known as Natural 
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Language Processing, Computational Linguistics and Speech Recognition 
and Synthesis, which together we call Speech and Language Processing. 
The goal of this book is to describe the state of the art of this technology 
at the start of the twenty-first century. The applications we will consider 
are all of those needed for agents like HAL, as well as other valuable areas 
of language processing such as spelling correction, grammar checking, 
information retrieval, and machine translation. 


1.1 Knowledge in Speech and Language Processing 

By speech and language processing, we have in mind those computational 
techniques that process spoken and written human language, as language. 
As we will see, this is an inclusive definition that encompasses everything 
from mundane applications such as word counting and automatic hyphen¬ 
ation, to cutting edge applications such as automated question answering on 
the Web, and real-time spoken language translation. 

What distinguishes these language processing applications from other 
data processing systems is their use of knowledge of language. Consider the 
Unix wc program, which is used to count the total number of bytes, words, 
and lines in a text file. When used to count bytes and lines, wc is an ordinary 
data processing application. However, when it is used to count the words 
in a file it requires knowledge about what it means to be a word , and thus 
becomes a language processing system. 

Of course, wc is an extremely simple system with an extremely lim¬ 
ited and impoverished knowledge of language. More sophisticated language 
agents such as HAL require much broader and deeper knowledge of lan¬ 
guage. To get a feeling for the scope and kind of knowledge required in 
more sophisticated applications, consider some of what HAL would need to 
know to engage in the dialogue that begins this chapter. 

To determine what Dave is saying, HAL must be capable of analyzing 
an incoming audio signal and recovering the exact sequence of words Dave 
used to produce that signal. Similarly, in generating its response, HAL must 
be able to take a sequence of words and generate an audio signal that Dave 
can recognize. Both of these tasks require knowledge about phonetics and 
phonology, which can help model how words are pronounced in colloquial 
speech (Chapter 4 and Chapter 5). 

Note also that unlike Star Trek’s Commander Data, HAL is capable of 
producing contractions like Vm and can’t. Producing and recognizing these 
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and other variations of individual words (for example recognizing that doors 
is plural) requires knowledge about morphology, which captures informa¬ 
tion about the shape and behavior of words in context (Chapter 2, Chapter 3). 

Moving beyond individual words, HAL must know how to analyze the 
structure underlying Dave’s request. Such an analysis is necessary among 
other reasons for HAL to determine that Dave’s utterance is a request for 
action, as opposed to a simple statement about the world or a question about 
the door, as in the following variations of his original statement. 

HAL, the pod bay door is open. 

HAL, is the pod bay door open? 

In addition, HAL must use similar structural knowledge to properly string 
together the words that constitute its response. For example, HAL must 
know that the following sequence of words will not make sense to Dave, 
despite the fact that it contains precisely the same set of words as the original. 

I’m I do, sorry that afraid Dave I’m can’t. 

The knowledge needed to order and group words together comes under the 
heading of syntax. 

Of course, simply knowing the words and the syntactic structure of 
what Dave said does not tell HAL much about the nature of his request. 
To know that Dave’s command is actually about opening the pod bay door, 
rather than an inquiry about the day’s lunch menu, requires knowledge of 
the meanings of the component words, the domain of lexical semantics, 
and knowledge of how these components combine to form larger meanings, 
compositional semantics. 

Next, despite its bad behavior, HAL knows enough to be polite to 
Dave. It could, for example, have simply replied No or No, I won't open 
the door. Instead, it first embellishes its response with the phrases I'm sorry 
and I'm afraid , and then only indirectly signals its refusal by saying I can % 
rather than the more direct (and truthful) I won't. 1 The appropriate use of this 
kind of polite and indirect language comes under the heading of pragmatics. 

Finally, rather than simply ignoring Dave’s command and leaving the 
door closed, HAL chooses to engage in a structured conversation relevant 
to Dave’s initial request. HAL’s correct use of the word that in its answer 
to Dave’s request is a simple illustration of the kind of between-utterance 

L For those unfamiliar with HAL, it is neither sorry nor afraid, nor is it incapable of opening 
the door. It has simply decided in a fit of paranoia to kill its crew. 
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device common in such conversations. Correctly structuring these such con¬ 
versations requires knowledge of discourse conventions. 

To summarize, the knowledge of language needed to engage in com¬ 
plex language behavior can be separated into six distinct categories. 

• Phonetics and Phonology - The study of linguistic sounds. 

• Morphology - The study of the meaningful components of words. 

• Syntax - The study of the structural relationships between words. 

• Semantics - The study of meaning. 

• Pragmatics - The study of how language is used to accomplish goals. 

• Discourse - The study of linguistic units larger than a single utterance. 


1.2 Ambiguity 

A perhaps surprising fact about the six categories of linguistic knowledge is 
that most or all tasks in speech and language processing can be viewed as 
ambiguity resolving ambiguity at one of these levels. We say some input is ambiguous 
if there are multiple alternative linguistic structures than can be built for it. 
Consider the spoken sentence I made her duck. Here’s five different mean¬ 
ings this sentence could have (there are more) each of which exemplifies an 
ambiguity at some level: 

(1.1) I cooked waterfowl for her. 

(1.2) I cooked waterfowl belonging to her. 

(1.3) I created the (plaster?) duck she owns. 

(1.4) I caused her to quickly lower her head or body. 

(1.5) I waved my magic wand and turned her into undifferentiated 
waterfowl. 

These different meanings are caused by a number of ambiguities. First, the 
words duck and her are morphologically or syntactically ambiguous in their 
part of speech. Duck can be a verb or a noun, while her can be a dative 
pronoun or a possessive pronoun. Second, the word make is semantically 
ambiguous; it can mean create or cook. Finally, the verb make is syntac¬ 
tically ambiguous in a different way. Make can be transitive, i.e. taking a 
single direct object (1.2), or it can be ditransitive, i.e. taking two objects 

(1.5) , meaning that the first object {her) got made into the second object 
{duck). Finally, make can take a direct object and a verb (1.4), meaning that 
the object {her) got caused to perform the verbal action {duck). Furthermore, 
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in a spoken sentence, there is an even deeper kind of ambiguity; the first 
word could have been eye or the second word maid. 

We will often introduce the models and algorithms we present through¬ 
out the book as ways to resolve these ambiguities. For example deciding 
whether duck is a verb or a noun can be solved by part of speech tagging. 
Deciding whether make means ‘create’ or ‘cook’ can be solved by word 
sense disambiguation. Deciding whether her and duck are part of the same 
entity (as in (1.1) or (1.4)) or are different entity (as in (1.2)) can be solved 
by probabilistic parsing. Ambiguities that don’t arise in this particular ex¬ 
ample (like whether a given sentence is a statement or a question) will also 
be resolved, for example by speech act interpretation. 

1.3 Models and Algorithms 

One of the key insights of the last fifty years of research in language pro¬ 
cessing is that the various kinds of knowledge described in the last sections 
can be captured through the use of a small number of formal models, or the¬ 
ories. Fortunately, these models and theories are all drawn from the standard 
toolkits of Computer Science, Mathematics, and Linguistics and should be 
generally familiar to those trained in those fields. Among the most important 
elements in this toolkit are state machines, formal rule systems, logic, as 
well as probability theory and other machine learning tools. These mod¬ 
els, in turn, lend themselves to a small number of algorithms from well- 
known computational paradigms. Among the most important of these are 
state space search algorithms and dynamic programming algorithms. 

In their simplest formulation, state machines are formal models that 
consist of states, transitions among states, and an input representation. Among 
the variations of this basic model that we will consider are deterministic and 
non-deterministic finite-state automata, finite-state transducers, which 
can write to an output device, weighted automata, Markov models and 
hidden Markov models which have a probabilistic component. 

Closely related to these somewhat procedural models are their declar¬ 
ative counterparts: formal rule systems. Among the more important ones we 
will consider are regular grammars and regular relations, context-free 
grammars, feature-augmented grammars, as well as probabilistic vari¬ 
ants of them all. State machines and formal rule systems are the main tools 
used when dealing with knowledge of phonology, morphology, and syntax. 

The algorithms associated with both state-machines and formal rule 
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systems typically involve a search through a space of states representing hy¬ 
potheses about an input. Representative tasks include searching through a 
space of phonological sequences for a likely input word in speech recog¬ 
nition, or searching through a space of trees for the correct syntactic parse 
of an input sentence. Among the algorithms that are often used for these 
tasks are well-known graph algorithms such as depth-first search, as well 
as heuristic variants such as best-first, and A* search. The dynamic pro¬ 
gramming paradigm is critical to the computational tractability of many of 
these approaches by ensuring that redundant computations are avoided. 

The third model that plays a critical role in capturing knowledge of 
language is logic. We will discuss first order logic, also known as the pred¬ 
icate calculus, as well as such related formalisms as feature-structures, se¬ 
mantic networks, and conceptual dependency. These logical representations 
have traditionally been the tool of choice when dealing with knowledge of 
semantics, pragmatics, and discourse (although, as we will see, applications 
in these areas are increasingly relying on the simpler mechanisms used in 
phonology, morphology, and syntax). 

Probability theory is the final element in our set of techniques for cap¬ 
turing linguistic knowledge. Each of the other models (state machines, for¬ 
mal rule systems, and logic) can be augmented with probabilities. One major 
use of probability theory is to solve the many kinds of ambiguity problems 
that we discussed earlier; almost any speech and language processing prob¬ 
lem can be recast as: ‘given N choices for some ambiguous input, choose 
the most probable one’. 

Another major advantage of probabilistic models is that they are one of 
a class of machine learning models. Machine learning research has focused 
on ways to automatically learn the various representations described above; 
automata, rule systems, search heuristics, classifiers. These systems can be 
trained on large corpora and can be used as a powerful modeling technique, 
especially in places where we don’t yet have good causal models. Machine 
learning algorithms will be described throughout the book. 


1.4 Language, Thought, and Understanding 

To many, the ability of computers to process language as skillfully as we do 
will signal the arrival of truly intelligent machines. The basis of this belief is 
the fact that the effective use of language is intertwined with our general cog¬ 
nitive abilities. Among the first to consider the computational implications 
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of this intimate connection was Alan Turing (1950). In this famous paper, 

Turing introduced what has come to be known as the Turing Test. Turing Turing test 
began with the thesis that the question of what it would mean for a machine 
to think was essentially unanswerable due to the inherent imprecision in the 
terms machine and think. Instead, he suggested an empirical test, a game, 
in which a computer’s use of language would form the basis for determin¬ 
ing if it could think. If the machine could win the game it would be judged 
intelligent. 

In Turing’s game, there are three participants: 2 people and a computer. 

One of the people is a contestant and plays the role of an interrogator. To 
win, the interrogator must determine which of the other two participants is 
the machine by asking a series of questions via a teletype. The task of the 
machine is to fool the interrogator into believing it is a person by responding 
as a person would to the interrogator’s questions. The task of the second 
human participant is to convince the interrogator that the other participant is 
the machine, and that they are human. 

The following interaction from Turing’s paper illustrates the kind of 
interactions he had in mind. Clearly a convincing simulation of humanity 
doesn’t require being an expert in all domains: 


Q: Please write me a sonnet on the topic of the Forth Bridge. 

A: Count me out on this one. I never could write poetry 

Q: Add 34957 to 70764 

A: (Pause about 30 seconds and then give answer as) 105621. 

Turing predicted that roughly by the end of the twentieth century a machine 
with 10 gigabytes of memory would have around a 30% chance of fooling a 
human interrogator after 5 minutes of questioning. Given the fact that you 
can fool some of the people all the time, it is not clear how rigorous this 
particular standard is. Nevertheless, the critical issue for Turing was that 
using language as humans do is sufficient, by itself, as an operational test for 
intelligence. 

Surprisingly, many of the issues raised by Turing became relevant as 
early as 1966 in the context of the ELIZA program (Weizenbaum, 1966). 
ELIZA was an early natural language processing system capable of carry¬ 
ing on a limited form of conversation with a user. Consider the following 
session with a version of ELIZA that imitated the responses of a Rogerian 
psychotherapist. 
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Useri: You are like my father in some ways. 

ELIZAi: WHAT RESEMBLANCE DO YOU SEE 

User 2 : You are not very aggressive but I think you don’t want me to notice that. 

ELIZA 2 : WHAT MAKES YOU THINK I AM NOT AGGRESSIVE 

User 3 : You don’t argue with me. 

ELIZA 3 : WHY DO YOU THINK I DON’T ARGUE WITH YOU 

User 4 : You are afraid of me. 

ELIZA 4 : DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU 

As we will discuss in more detail in Chapter 2, ELIZA is a remarkably sim¬ 
ple program that makes use of pattern-matching to process the input and 
translate it into suitable outputs. The success of this simple technique in this 
domain is due to the fact that ELIZA doesn’t actually need to know anything 
to mimic a Rogerian psychotherapist. As Weizenbaum notes, this is one of 
the few dialogue genres where the listener can act as if they know nothing of 
the world. 

ELIZA deep relevance to Turing’s ideas is that many people who in¬ 
teracted with ELIZA came to believe that it really understood them and their 
problems. Indeed, Weizenbaum (1976) notes that many of these people con¬ 
tinued to believe in ELIZA’s abilities even after the program’s operation was 
explained to them. In more recent years, Weizenbaum’s informal reports 
have been repeated in a somewhat more controlled setting. Since 1991, an 
event known as the Loebner Prize competition has attempted to put various 
computer programs to the Turing test. Although these contests have proven 
to have little scientific interest, a consistent result over the years has been 
that even the crudest programs can fool some of the judges some of the time 
(Shieber, 1994). Not surprisingly, these results have done nothing to quell 
the ongoing debate over the suitability of the Turing test as a test for intelli¬ 
gence among philosophers and AI researchers (Searle, 1980). 

Fortunately, for the purposes of this book, the relevance of these results 
does not hinge on whether or not computers will ever be intelligent, or un¬ 
derstand natural language. Far more important is recent related research in 
the social sciences that has confirmed another of Turing’s predictions from 
the same paper. 

Nevertheless I believe that at the end of the century the use of 
words and educated opinion will have altered so much that we 
will be able to speak of machines thinking without expecting to 
be contradicted. 

It is now clear that regardless of what people believe or know about the in- 
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ner workings of computers, they talk about them and interact with them as 
social entities. People act toward computers as if they were people; they are 
polite to them, treat them as team members, and expect among other things 
that computers should be able to understand their needs, and be capable of 
interacting with them naturally. For example, Reeves and Nass (1996) found 
that when a computer asked a human to evaluate how well the computer had 
been doing, the human gives more positive responses than when a different 
computer asks the same questions. People seemed to be afraid of being im¬ 
polite. In a different experiment, Reeves and Nass found that people also 
give computers higher performance ratings if the computer has recently said 
something flattering to the human. Given these predispositions, speech and 
language-based systems may provide many users with the most natural in¬ 
terface for many applications. This fact has led to a long-term focus in the 
field on the design of conversational agents, artificial entities which com¬ 
municate conversationally. 


1.5 The State of the Art and The Near-Term 
Future 

We can only see a short distance ahead, but we can see plenty 

there that needs to be done. 

- Alan Turing. 

This is an exciting time for the field of speech and language processing. 
The recent commercialization of robust speech recognition systems, and the 
rise of the World-Wide Web, have placed speech and language processing 
applications in the spotlight, and have pointed out a plethora of exciting pos¬ 
sible applications. The following scenarios serve to illustrate some current 
applications and near-term possibilities. 

A Canadian computer program accepts daily weather data and gener¬ 
ates weather reports that are passed along unedited to the public in English 
and French (Chandioux, 1976). 

The Babel Fish translation system from Systran handles over 1,000,000 
translation requests a day from the AltaVista search engine site. 

A visitor to Cambridge, Massachusetts, asks a computer about places 
to eat using only spoken language. The system returns relevant information 
from a database of facts about the local restaurant scene (Zue et al , 1991). 

These scenarios represent just a few of applications possible given cur- 
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rent technology. The following, somewhat more speculative scenarios, give 
some feeling for applications currently being explored at research and devel¬ 
opment labs around the world. 

A computer reads hundreds of typed student essays and assigns grades 
to them in a manner that is indistinguishable from human graders (Landauer 
etaL, 1997). 

A satellite operator uses language to ask questions and commands to a 
computer that controls a world-wide network of satellites (?). 

German and Japanese entrepreneurs negotiate a time and place to meet 
in their own languages using small hand-held communication devices (?). 

Closed-captioning is provided in in any of a number of languages for 
a broadcast news program by a computer listening to the audio signal (?). 

A computer equipped with a vision system watches a professional soc¬ 
cer game and provides an automated natural language account of the game 
(?). 

1.6 Some Brief History 

Historically, speech and language processing has been treated very differ¬ 
ently in computer science, electrical engineering, linguistics, and psychol¬ 
ogy/cognitive science. Because of this diversity, speech and language pro¬ 
cessing encompasses a number of different but overlapping fields in these 
different departments: computational linguistics in linguistics, natural lan¬ 
guage processing in computer science, speech recognition in electrical en¬ 
gineering, computational psycholinguistics in psychology. This section 
summarizes the different historical threads which have given rise to the field 
of speech and language processing. This section will provide only a sketch; 
the individual chapters will provide more detail on each area. 

Foundational Insights: 1940’s and 1950’s 

The earliest roots of the field date to the intellectually fertile period just 
after World War II which gave rise to the computer itself. This period 
from the 1940s through the end of the 1950s saw intense work on two 
foundational paradigms: the automaton and probabilistic or information- 
theoretic models. 

The automaton arose in the 1950s out of Turing’s (1950) model of 
algorithmic computation, considered by many to be the foundation of mod- 
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em computer science. Turing’s work led to the McCulloch-Pitts neuron 
(McCulloch and Pitts, 1943), a simplified model of the neuron as a kind of 
computing element that could be described in terms of propositional logic, 
and then to the work of Kleene (1951) and (1956) on finite automata and reg¬ 
ular expressions. Automata theory was contributed to by Shannon (1948), 
who applied probabilistic models of discrete Markov processes to automata 
for language. Drawing the idea of a finite-state Markov process from Shan¬ 
non’s work, Chomsky (1956) first considered finite-state machines as a way 
to characterize a grammar, and defined a finite-state language as a language 
generated by a finite-state grammar. These early models led to the field of 
formal language theory, which used algebra and set theory to define formal 
languages as sequences of symbols. This includes the context-free grammar, 
first defined by Chomsky (1956) for natural languages but independently dis¬ 
covered by Backus (1959) and Naur et al (1960) in their descriptions of the 
ALGOL programming language. 

The second foundational insight of this period was the development of 
probabilistic algorithms for speech and language processing, which dates to 
Shannon’s other contribution: the metaphor of the noisy channel and de¬ 
coding for the transmission of language through media like communication 
channels and speech acoustics. Shannon also borrowed the concept of en¬ 
tropy from thermodynamics as a way of measuring the information capacity 
of a channel, or the information content of a language, and performed the 
first measure of the entropy of English using probabilistic techniques. 

It was also during this early period that the sound spectrograph was 
developed (Koenig et al , 1946), and foundational research was done in in¬ 
strumental phonetics that laid the groundwork for later work in speech recog¬ 
nition. This led to the first machine speech recognizers in the early 1950’s. 
In 1952, researchers at Bell Labs built a statistical system that could rec¬ 
ognize any of the 10 digits from a single speaker (Davis et al , 1952). The 
system had 10 speaker-dependent stored patterns roughly representing the 
first two vowel formants in the digits. They achieved 97-99% accuracy by 
choosing the pattern which had the highest relative correlation coefficient 
with the input. 

The Two Camps: 1957-1970 

By the end of the 1950s and the early 1960s, speech and language processing 
had split very cleanly into two paradigms: symbolic and stochastic. 

The symbolic paradigm took off from two lines of research. The first 
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was the work of Chomsky and others on formal language theory and gen¬ 
erative syntax throughout the late 1950’s and early to mid 1960’s, and the 
work of many linguistics and computer scientists on parsing algorithms, ini¬ 
tially top-down and bottom-up, and then via dynamic programming. One 
of the earliest complete parsing systems was Zelig Harris’s Transformations 
and Discourse Analysis Project (TDAP), which was implemented between 
June 1958 and July 1959 at the University of Pennsylvania (Harris, 1962). 2 
The second line of research was the new field of artificial intelligence. In 
the summer of 1956 John McCarthy, Marvin Minsky, Claude Shannon, and 
Nathaniel Rochester brought together a group of researchers for a two month 
workshop on what they decided to call artificial intelligence. Although AI al¬ 
ways included a minority of researchers focusing on stochastic and statistical 
algorithms (include probabilistic models and neural nets), the major focus of 
the new field was the work on reasoning and logic typified by Newell and 
Simon’s work on the Logic Theorist and the General Problem Solver. At this 
point early natural language understanding systems were built, These were 
simple systems which worked in single domains mainly by a combination 
of pattern matching and key-word search with simple heuristics for reason¬ 
ing and question-answering. By the late 1960’s more formal logical systems 
were developed. 

The stochastic paradigm took hold mainly in departments of statistics 
and of electrical engineering. By the late 1950’s the Bayesian method was 
beginning to be applied to to the problem of optical character recognition. 
Bledsoe and Browning (1959) built a Bayesian system for text-recognition 
that used a large dictionary and computed the likelihood of each observed let¬ 
ter sequence given each word in the dictionary by multiplying the likelihoods 
for each letter. Mosteller and Wallace (1964) applied Bayesian methods to 
the problem of authorship attribution on The Federalist papers. 

The 1960s also saw the rise of the first serious testable psychological 
models of human language processing based on transformational grammar, 
as well as the first online corpora: the Brown corpus of American English, 
a 1 million word collection of samples from 500 written texts from different 
genres (newspaper, novels, non-fiction, academic, etc.), which was assem¬ 
bled at Brown University in 1963-64 (Kucera and Francis, 1967; Francis, 
1979; Francis and Kucera, 1982), and William S. Y. Wang’s 1967 DOC (Dic- 


2 This system was reimplemented recently and is described by Joshi and Hopely (1999) 
and Karttunen (1999), who note that the parser was essentially implemented as a cascade of 
finite-state transducer. 



Section 1.6. Some Brief History 


13 


tionary on Computer), an on-line Chinese dialect dictionary. 

Four Paradigms: 1970-1983 

The next period saw an explosion in research in speech and language pro¬ 
cessing, and the development of a number of research paradigms which still 
dominate the field. 

The stochastic paradigm played a huge role in the development of 
speech recognition algorithms in this period, particularly the use of the Hid¬ 
den Markov Model and the metaphors of the noisy channel and decoding, 
developed independently by Jelinek, Bahl, Mercer, and colleagues at IBM’s 
Thomas J. Watson Research Center, and Baker at Carnegie Mellon Univer¬ 
sity, who was influenced by the work of Baum and colleagues at the Institute 
for Defense Analyses in Princeton. AT&T’s Bell Laboratories was also a 
center for work on speech recognition and synthesis; see (Rabiner and Juang, 
1993) for descriptions of the wide range of this work. 

The logic-based paradigm was begun by the work of Colmerauer and 
his colleagues on Q-systems and metamorphosis grammars (Colmerauer, 
1970,1975), the forerunners of Prolog and Definite Clause Grammars (Pereira 
and Warren, 1980). Independently, Kay’s (1979) work on functional gram¬ 
mar, and shortly later, (1982)’s (1982) work on LFG, established the impor¬ 
tance of feature structure unification. 

The natural language understanding field took off during this period, 
beginning with Terry Winograd’s SHRDLU system which simulated a robot 
embedded in a world of toy blocks (Winograd, 1972a). The program was 
able to accept natural language text commands (Move the red block on top 
of the smaller green one) of a hitherto unseen complexity and sophistication. 
His system was also the first to attempt to build an extensive (for the time) 
grammar of English, based on Halliday’s systemic grammar. Winograd’s 
model made it clear that the problem of parsing was well-enough understood 
to begin to focus on semantics and discourse models. Roger Schank and his 
colleagues and students (in was often referred to as the Yale School) built a 
series of language understanding programs that focused on human concep¬ 
tual knowledge such as scripts, plans and goals, and human memory organi¬ 
zation (Schank and Abelson, 1977; Schank and Riesbeck, 1981; Cullingford, 
1981; Wilensky, 1983; Lehnert, 1977). This work often used network-based 
semantics (Quillian, 1968; Norman and Rumelhart, 1975; Schank, 1972; 
Wilks, 1975c, 1975b; Kintsch, 1974) and began to incorporate Fillmore’s 
notion of case roles (Fillmore, 1968) into their representations (Simmons, 
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1973a). 

The logic-based and natural-language understanding paradigms were 
unified on systems that used predicate logic as a semantic representation, 
such as the LUNAR question-answering system (Woods, 1967, 1973). 

The discourse modeling paradigm focused on four key areas in dis¬ 
course. Grosz and her colleagues proposed ideas of discourse structure and 
discourse focus (Grosz, 1977a; Sidner, 1983a), a number of researchers be¬ 
gan to work on automatic reference resolution (Hobbs, 1978a), and the BDI 
(Belief-Desire-Intention) framework for logic-based work on speech acts 
was developed (Perrault and Allen, 1980; Cohen and Perrault, 1979). 

Empiricism and Finite State Models Redux: 1983-1993 

This next decade saw the return of two classes of models which had lost 
popularity in the late 50’s and early 60’s, partially due to theoretical argu¬ 
ments against them such as Chomsky’s influential review of Skinner’s Verbal 
Behavior (Chomsky, 1959b). The first class was finite-state models, which 
began to receive attention again after work on finite-state phonology and 
morphology by (Kaplan and Kay, 1981) and finite-state models of syntax by 
Church (1980). A large body of work on finite-state models will be described 
throughout the book. 

The second trend in this period was what has been called the ‘return of 
empiricism’; most notably here was the rise of probabilistic models through¬ 
out speech and language processing, influenced strongly by the work at the 
IBM Thomas J. Watson Research Center on probabilistic models of speech 
recognition. These probabilistic methods and other such data-driven ap¬ 
proaches spread into part of speech tagging, parsing and attachment ambi¬ 
guities, and connectionist approaches from speech recognition to semantics. 

This period also saw considerable work on natural language genera¬ 
tion. 

The Field Comes Together: 1994-1999 

By the last five years of the millennium it was clear that the field was vastly 
changing. First, probabilistic and data-driven models had become quite stan¬ 
dard throughout natural language processing. Algorithms for parsing, part 
of speech tagging, reference resolution, and discourse processing all began 
to incorporate probabilities, and employ evaluation methodologies borrowed 
from speech recognition and information retrieval. Second, the increases in 
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the speed and memory of computers had allowed commercial exploitation 
of a number of subareas of speech and language processing, in particular 
speech recognition and spelling and grammar checking. Finally, the rise of 
the Web emphasized the need for language-based information retrieval and 
information extraction. 

A Final Brief Note on Psychology 

Many of the chapters in this book include short summaries of psychological 
research on human processing. Of course, understanding human language 
processing is an important scientific goal in its own right, and is part of the 
general field of cognitive science. However, an understanding of human 
language processing can often be helpful in building better machine mod¬ 
els of language. This seems contrary to the popular wisdom, which holds 
that direct mimicry of nature’s algorithms is rarely useful in engineering ap¬ 
plications. For example the argument is often made that if we copied nature 
exactly, airplanes would flap their wings; yet airplanes with fixed wings are a 
more successful engineering solution. But language is not aeronautics. Crib¬ 
bing from nature is sometimes useful for aeronautics (after all, airplanes do 
have wings), but it is particularly useful when we are trying to solve human- 
centered tasks. Airplane flight has different goals than bird flight; but the 
goal of speech recognition systems, for example, is to perform exactly the 
task that human court reporters perform every day: transcribe spoken dialog. 
Since people already do this well, we can learn from nature’s previous solu¬ 
tion. Since we are building speech recognition systems in order to interact 
with people, it makes sense to copy a solution that behaves the way people 
are accustomed to. 

1.7 Summary 

This chapter introduces the field of speech and language processing. The 
following are some of the highlights of this chapter. 

• A good way to understand the concerns of speech and language pro¬ 
cessing research is to consider what it would take to create an intelli¬ 
gent agent like HAL from 2001: A Space Odyssey. 

• Speech and language technology relies on formal models, or represen¬ 
tations, of knowledge of language at the levels of phonology and pho¬ 
netics, morphology, syntax, semantics, pragmatics and discourse. A 



16 


Chapter 1. Introduction 


small number of formal models including state machines, formal rule 
systems, logic, and probability theory are used to capture this knowl¬ 
edge. 

• The foundations of speech and language technology lie in computer 
science, linguistics, mathematics, electrical engineering and psychol¬ 
ogy. A small number of algorithms from standard frameworks are used 
throughout speech and language processing, 

• The critical connection between language and thought has placed speech 
and language processing technology at the center of debate over intel¬ 
ligent machines. Furthermore, research on how people interact with 
complex media indicates that speech and language processing technol¬ 
ogy will be critical in the development of future technologies. 

• Revolutionary applications of speech and language processing are cur¬ 
rently in use around the world. Recent advances in speech recognition 
and the creation of the World-Wide Web will lead to many more appli¬ 
cations. 


Bibliographical and Historical Notes 

Research in the various subareas of speech and language processing is spread 
across a wide number of conference proceedings and journals. The con¬ 
ferences and journals most centrally concerned with computational linguis¬ 
tics and natural language processing are associated with the Association for 
Computational Linguistics (ACL), its European counterpart (EACL), and the 
International Conference on Computational Linguistics (COLING). The an¬ 
nual proceedings of ACL and EACL, and the biennial COLING conference 
are the primary forums for work in this area. Related conferences include 
the biennial conference on Applied Natural Language Processing (ANLP) 
and the conference on Empirical Methods in Natural Language Processing 
(EMNLaP). The journal Computational Linguistics is the premier publica¬ 
tion in the field, although it has a decidedly theoretical and linguistic ori¬ 
entation. The journal Natural Language Engineering covers more practical 
applications of speech and language research. 

Research on speech recognition, understanding, and synthesis is pre¬ 
sented at the biennial International Conference on Spoken Language Pro¬ 
cessing (ICSLP) which alternates with the European Conference on Speech 
Communication and Technology (EUROSPEECH). The IEEE International 
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Conference on Acoustics, Speech, & Signal Processing (IEEE ICASSP) 
is held annually, as is the meeting of the Acoustical Society of America. 
Speech journals include Speech Communication , Computer Speech and Lan¬ 
guage, and IEEE Transactions on Pattern Analysis and Machine Intelli¬ 
gence. 

Work on language processing from an Artificial Intelligence perspec¬ 
tive can be found in the annual meetings of the American Association for Ar¬ 
tificial Intelligence (AAAI), as well as the biennial International Joint Con¬ 
ference on Artificial Intelligence (IJCAI) meetings. The following artificial 
intelligence publications periodically feature work on speech and language 
processing: Artificial Intelligence , Computational Intelligence , IEEE Trans¬ 
actions on Intelligent Systems , and the Journal of Artificial Intelligence Re¬ 
search. Work on cognitive modeling of language can be found at the annual 
meeting of the Cognitive Science Society, as well as its journal Cognitive 
Science. An influential series of closed workshops was held by ARPA, called 
variously the DARPA Speech and Natural Language Processing Workshop or 
the ARPA Workshop on Human Language Technology. 

The are a fair number of textbooks available covering various aspects 
of speech and language processing. (Manning and Schutze, 1999) ( Foun¬ 
dations of Statistical Language Processing) focuses on statistical models of 
tagging, parsing, disambiguation, collocations, and other areas. Chamiak 
(1993) (Statistical Language Learning ) is an accessible, though less exten¬ 
sive, introduction to similar material. Allen (1995) (Natural Language Un¬ 
derstanding) provides extensive coverage of language processing from the 
AI perspective. (Gazdar and Mellish, 1989) (Natural Language Process¬ 
ing in Lisp/Prolog) covers especially automata, parsing, features, and uni¬ 
fication. (Pereira and Shieber, 1987) gives a Prolog-based introduction to 
parsing and interpretation. Russell and Norvig (1995) is an introduction to 
artificial intelligence that includes chapters on natural language processing. 
Partee (1990) has a very broad coverage of mathematical linguistics. (Cole, 
1997) is a volume of survey papers covering the entire field of speech and 
language processing. A somewhat dated but still tremendously useful col¬ 
lection of foundational papers can be found in (Grosz et al. , 1986) (Readings 
in Natural Language Processing). 

Of course, a wide-variety of speech and language processing resources 
are now available on the World-Wide Web. Pointers to these resources are 
maintained on the homepage for this book at www.cs.colorado.edu/ mar- 
tin/slp.html. 




Part I 

WORDS 


Words are the fundamental building block of language. Every human 
language, spoken, signed, or written, is composed of words. Every 
area of speech and language processing, from speech recognition to 
machine translation to information retrieval on the web, requires ex¬ 
tensive knowledge about words. Psycholinguistic models of human 
language processing and models from generative linguistic are also 
heavily based on lexical knowledge. 

The six chapters in this part introduce computational models 
of the spelling, pronunciation, and morphology of words and cover 
three important real-world tasks that rely on lexical knowledge: au¬ 
tomatic speech recognition (ASR), text-to-speech synthesis (TTS), 
and spell-checking. Finally, these chapters define perhaps the most 
important computational model for of speech and language process¬ 
ing: the automaton. Four kinds of automata are covered: finite- 
state automata (FSAs) and regular expressions, finite-state transducers 
(FSTs), weighted transducers, and the Hidden Markov Model (HMM), 
as well as the A-gram model of word sequences. 









REGULAR EXPRESSIONS 
AND AUTOMATA 


“In the old days , if you wanted to impeach a witness you had to 
go back and fumble through endless transcripts . Now it's on a 
screen somewhere or on a disk and I can search for a particular 
word - say every time the witness used the word glove - and then 
quickly ask a question about what he said years ago . Right away 
you see the witness get flustered." 

Johnnie L. Cochran Jr., attorney, New York Times , 9/28/97 


Imagine that you have become a passionate fan of woodchucks. De¬ 
siring more information on this celebrated woodland creature, you turn to 
your favorite web browser and type in woodchuck. Your browser returns a 
few sites. You have a flash of inspiration and type in woodchucks. This time 
you discover ‘interesting links to woodchucks and lemurs’ and ‘all about 
Vermont’s unique, endangered species’. Instead of having to do this search 
twice, you would have rather typed one search command specifying some¬ 
thing like woodchuck with an optional final s. Furthermore, you might want 
to find a site whether or not it spelled woodchucks with a capital W (Wood¬ 
chuck). Or perhaps you might want to search for all the prices in some docu¬ 
ment; you might want to see all strings that look like $199 or $25 or $24.99. 
In this chapter we introduce the regular expression, the standard notation 
for characterizing text sequences. The regular expression is used for spec¬ 
ifying text strings in situations like this web-search example, and in other 
information retrieval applications, but also plays an important role in word¬ 
processing (in PC, Mac, or UNIX applications), computation of frequencies 
from corpora, and other such tasks. 
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After we have defined regular expressions, we show how they can be 
implemented via the finite-state automaton. The finite-state automaton is 
not only the mathematical device used to implement regular expressions, but 
also one of the most significant tools of computational linguistics. Variations 
of automata such as finite-state transducers, Hidden Markov Models, and 
V-gram grammars are important components of the speech recognition and 
synthesis, spell-checking, and information-extraction applications that we 
will introduce in later chapters. 


2.1 Regular Expressions 

SIR ANDREW Her C’s, her U’s and her T’s: why that? 

Shakespeare, Twelfth Night 

One of the unsung successes in standardization in computer science 
expression has been the regular expression (RE), a language for specifying text search 
strings. The regular expression languages used for searching texts in UNIX 
(vi, Perl, Emacs, grep), Microsoft Word (version 6 and beyond), and Word¬ 
Perfect are almost identical, and many RE features exist in the various Web 
search engines. Besides this practical use, the regular expression is an im¬ 
portant theoretical tool throughout computer science and linguistics. 

A regular expression (first developed by Kleene (1956) but see the His¬ 
tory section for more details) is a formula in a special language that is used 
strings for specifying simple classes of strings. A string is a sequence of symbols; 

for the purpose of most text-based search techniques, a string is any sequence 
of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation). 
For these purposes a space is just a character like any other, and we represent 
it with the symbol 

Formally, a regular expression is an algebraic notation for characteriz¬ 
ing a set of strings. Thus they can be used to specify search strings as well as 
to define a language in a formal way. We will begin by talking about regular 
expressions as a way of specifying searches in texts, and proceed to other 
uses. Section 2.3 shows that the use of just three regular expression opera¬ 
tors is sufficient to characterize strings, but we use the more convenient and 
commonly-used regular expression syntax of the Perl language throughout 
this section. Since common text-processing programs agree on most of the 
syntax of regular expressions, most of what we say extends to all UNIX, Mi¬ 
crosoft Word, and WordPerfect regular expressions. Appendix A shows the 
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few areas where these programs differ from the Perl syntax. 

Regular expression search requires a pattern that we want to search 
for, and a corpus of texts to search through. A regular expression search corpus 
function will search through the corpus returning all texts that contain the 
pattern. In an information retrieval (IR) system such as a web search engine, 
the texts might be entire documents or web pages. In a word-processor, the 
texts might be individual words, or lines of a document. In the rest of this 
chapter, we will use this last paradigm. Thus when we give a search pattern, 
we will assume that the search engine returns the line of the document re¬ 
turned. This is what the UNIX ‘grep’ command does. We will underline the 
exact part of the pattern that matches the regular expression. A search can be 
designed to return all matches to a regular expression or only the first match. 

We will show only the first match. 

Basic Regular Expression Patterns 

The simplest kind of regular expression is a sequence of simple characters. 

For example, to search for woodchuck , we type /woodchuck/. So the reg¬ 
ular expression /Buttercup / matches any string containing the substring 
Buttercup , for example the line Vm called little Buttercup) (recall that we 
are assuming a search application that returns entire lines). From here on 
we will put slashes around each regular expression to make it clear what is 
a regular expression and what is a pattern. We use the slash since this is the 
notation used by Perl, but the slashes are not part of the regular expressions. 

The search string can consist of a single letter (like / ! /) or a sequence 
of letters (like /urgl/); The first instance of each match to the regular ex¬ 
pression is underlined below (although a given application might choose to 
return more than just the first instance): 


RE 

Example Patterns Matched 

/woodchucks/ 

/a/ 

/Claire^says,/ 
/song/ 

/!/ 

“interesting links to woodchucks and lemurs” 
“Mary Ann stopped by Mona’s” 

“Dagmar, my gift please,” Claire says,” 

“all our pretty songs” 

“You’ve left the burglar behind again!” said Nori 


Regular expressions are case sensitive; lower-case / s / is distinct from 
upper-case / S /; ( / s / matches a lower case s but not an upper-case S ). This 
means that the pattern /woodchucks/ will not match the string Wood¬ 
chucks. We can solve this problem with the use of the square braces [ and ]. 
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The string of characters inside the braces specify a disjunction of characters 
to match. For example Figure 2.1 shows that the pattern / [wW] / matches 
patterns containing either w or W. 


RE 

Match 

Example Patterns 

/[wW]oodchuck/ 
/ [abc] / 

/ [1234567890]/ 

Woodchuck or woodchuck 
‘a’, ‘b’, or ‘c’ 
any digit 

“Woodchuck” 

“In uomini, in soldati” 
“plenty of 7 to 5” 

Figure 2.1 The use of the brackets [] to specify a disjunction of characters. 


The regular expression / [1234567890] / specified any single digit. 
While classes of characters like digits or letters are important building blocks 
in expressions, they can get awkward (e.g. it’s inconvenient to specify 

/[ABCDEFGHI JKLMNOPQRSTUVWXYZ]/ 

to mean ‘any capital letter’). In these cases the brackets can be used with 
range the dash (-) to specify any one character in a range. The pattern / [2- 
5 ] / specifies any one of the characters 2, 3, 4, or 5 . The pattern / [b-g] / 
specifies one of the characters b , c, d , e,/, or g. Some other examples: 


RE 

Match 

Example Patterns Matched 

/ [A-Z] / 
/ [a-z] / 
/ [0-9] / 

an uppercase letter 
a lowercase letter 
a single digit 

“we should call it ‘Drenched Blossoms’” 
“my beans were impatient to be hoed!” 
“Chapter 1: Down the Rabbit Hole” 

Figure 2.2 The use of the brackets [ ] plus the dash - to specify a range. 


The square braces can also be used to specify what a single charac¬ 
ter cannot be, by use of the caret ~. If the caret ~ is the first symbol after 
the open square brace [, the resulting pattern is negated. For example, the 
pattern / [ A a ] / matches any single character (including special characters) 
except a. This is only true when the caret is the first symbol after the open 
square brace. If it occurs anywhere else, it usually stands for a caret; Fig¬ 
ure 2.3 shows some examples. 

The use of square braces solves our capitalization problem for wood¬ 
chucks. But we still haven’t answered our original question; how do we 
specify both woodchuck and woodchucksl We can’t use the square brack¬ 
ets, because while they allow us to say ‘s or S’, they don’t allow us to say 
‘s or nothing’. For this we use the question-mark /?/, which means ‘the 
preceding character or nothing’, as shown in Figure 2.4. 
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RE 

Match (single characters) 

Example Patterns Matched 

TA-Z] 

rss] 

r\.i 

[e ~ ] 
a'b 

not an uppercase letter 
neither ‘S’ nor ‘s’ 
not a period 
either ‘e’ or ‘ ~ ’ 
the pattern ‘a A b’ 

“Oyfn pripetchik” 

“I have no exquisite reason for’t” 
“our resident Djinn” 

“look up 1 now” 

“look up a" b now” 

Figure 2.3 Uses of the caret " for negation or just to mean " 


RE 


Match 

Example Patterns Matched 

woodchucks ? 

woodchuck or woodchucks 

“woodchuck” 

colou?r 


color or colour 

“colour” 

Figure 2.4 

sion. 

The question-mark ? marks optionality of the previous expres- 


We can think of the question-mark as meaning ‘zero or one instances 
of the previous character’. That is, it’s a way of specifying how many of 
something that we want. So far we haven’t needn’t to specify that we want 
more than one of something. But sometimes we need regular expressions 
that allow repetitions of things. For example, consider the language of (cer¬ 
tain) sheep, which consists of strings that look like the following: 

baa! 

baaa! 

baaaa! 

baaaaa! 

baaaaaa! 


This language consists of strings with a b , followed by at least 2 a' s, 
followed by an exclamation point. The set of operators that allow us to say 
things like “some number of ‘a’s” are based on the asterisk or *, commonly 
called the Kleene * (pronounced “cleany star”). The Kleene star means kleene* 
‘zero or more occurrences of the immediately previous character or regular 
expression’. So /a*/ means ‘any string of zero or more a’s’. This will 
match a or aaaaaa but it will also match Off Minor, since the string Off 
Minor has zero a’s. So the regular expression for matching one or more 
a is /aa*/, meaning one a followed by zero or more a' s. More complex 
patterns can also be repeated. So / [ ab ] * / means ‘zero or more ‘a’s or ‘b’s’ 

(not ‘zero or more right square braces). This will match strings like aaaa or 
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KLEENE + 


ANCHORS 


ababab or bbbb. 

We now know enough to specify part of our regular expression for 
prices: multiple digits. Recall that the regular expression for an individual 
digit was / [ 0 - 9 ] /. So the regular expression for an integer (a string of 
digits) is / [0-9] [0-9] */. (Why isn’t it just / [0-9] */)? 

Sometimes it’s annoying to have to write the regular expression for dig¬ 
its twice, so there is a shorter way to specify ‘at least one’ of some character. 
This is the Kleene +, which means ‘one or more of the previous character’. 
Thus the expression / [ 0 - 9 ] + / is the normal way to specify ‘a sequence of 
digits’. There are thus two ways to specify the sheep language: /baaa* ! / 
or /baa+ ! /. 

One very important special character is the period (/ . /, a wildcard 
expression that matches any single character {except a carriage return): 


RE 

Match 

Example Patterns 

/beg.n/ 

any character between ‘beg’ and ‘n’ 

begin, beg’n, begun 

Figure 2.5 

The use of the period . to specify any character. 


The wildcard is often used together with the Kleene star to mean ‘any 
string of characters’. For example suppose we want to find any line in which 
a particular word, for example aardvark , appears twice. We can specify this 
with the regular expression /aardvark. ^aardvark/. 

Anchors are special characters that anchor regular expressions to par¬ 
ticular places in a string. The most common anchors are the caret " and the 
dollar-sign $. The caret " matches the start of a line. The pattern /" The/ 
matches the word The only at the start of a line. Thus there are three uses 
of the caret ": to match the start of a line, as a negation inside of square 
brackets, and just to mean a caret. (What are the contexts that allow Perl to 
know which function a given caret is supposed to have?). The dollar sign $ 
matches the end of a line. So the pattern is a useful pattern for matching 
a space at the end of a line, and / "The dog\ . $ / matches a line that con¬ 
tains only the phrase The dog. (We have to use the backslash here since we 
want the . to mean ‘period’ and not the wildcard). 

There are also two other anchors: \b matches a word boundary, while 
\B matches a non-boundary. Thus / \bthe\b/ matches the word the but 
not the word other. More technically, Perl defines a word as any sequence 
of digits, underscores or letters; this is based on the definition of ‘words’ in 
programming languages like Perl or C. For example, /\b9 9/ will match 
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the string 99 in There are 99 bottles of beer on the wall (because 99 follows 
a space) but not 99 in There are 299 bottles of beer on the wall (since 99 
follows a number). But it will match 99 in $99 (since 99 follows a dollar 
sign ($), which is not a digit, underscore, or letter). 

Disjunction, Grouping, and Precedence 

Suppose we need to search for texts about pets; perhaps we are particularly 
interested in cats and dogs. In such a case we might want to search for either 
the string cat or the string dog. Since we can’t use the square-brackets to 
search for ‘cat or dog’ (why not?) we need a new operator, the disjunction disjunction 
operator, also called the pipe symbol |. The pattern /cat | dog/ matches 
either the string cat or the string dog. 

Sometimes we need to use this disjunction operator in the midst of 
a larger sequence. For example, suppose I want to search for information 
about pet fish for my cousin David. How can I specify both guppy and 
guppiesl We cannot simply say /guppy lies/, because that would match 
only the strings guppy and ies. This is because sequences like guppy take 
precedence over the disjunction operator |. In order to make the disjunction precedence 
operator apply only to a specific pattern, we need to use the parenthesis 
operators ( and ). Enclosing a pattern in parentheses makes it act like a 
single character for the purposes of neighboring operators like the pipe | 
and the Kleene*. So the pattern /gupp (y | ies) / would specify that we 
meant the disjunction only to apply to the suffixes y and ies. 

The parenthesis operator ( is also useful when we are using counters 
like the Kleene*. Unlike the | operator, the Kleene* operator applies by 
default only to a single character, not a whole sequence. Suppose we want 
to match repeated instances of a string. Perhaps we have a line that has 
column labels of the form Column 1 Column 2 Column 3. The expression 
/Column^ [ 0 - 9 ] +,_,* / will not match any column; instead, it will match 
a column followed by any number of spaces! The star here applies only to 
the space that precedes it, not the whole sequence. With the parentheses, 
we could write the expression / (Column^ [ 0-9] +,_,*) */ to match the 
word Column , followed by a number and optional spaces, the whole pattern 
repeated any number of times. 

This idea that one operator may take precedence over another, requir¬ 
ing us to sometimes use parentheses to specify what we mean, is formalized 
by the operator precedence hierarchy for regular expressions. The follow- precedence 
ing table gives the order of RE operator precedence, from highest precedence 
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to lowest precedence: 

Parenthesis () 

Counters * + ? { } 

Sequences and anchors the "my endS 
Disjunction | 

Thus, because counters have a higher precedence than sequences, / 1 he * / 
matches theeeee but not thethe. Because sequences have a higher precedence 
than disjunction, /the | any / matches the or any but not theny. 

Patterns can be ambiguous in another way. Consider the expression 
/ [ a-z ] * / when matching against the text once upon a time. Since / [ a- 
z] */ matches zero or more letters, this expression could match nothing, or 
just the first letter < 9 , or on , or one , or once. In these cases regular expressions 
greedy always match the largest string they can; we say that patterns are greedy, 
expanding to cover as much of a string as they can. 

A simple example 

Suppose we wanted to write a RE to find cases of the English article the. A 
simple (but incorrect) pattern might be: 

/the/ 

One problem is that this pattern will miss the word when it begins 
a sentence and hence is capitalized (i.e. The). This might lead us to the 
following pattern: 

/ [tT]he/ 

But we will still incorrectly return texts with the embedded in other 
words (e.g. other or theology ). So we need to specify that we want instances 
with a word boundary on both sides: 

Ab [tT] he\b/ 

Suppose we wanted to do this without the use of /\b/? We might 
want this since / \ b/ won’t treat underscores and numbers as word bound¬ 
aries; but we might want to find the in some context where it might also have 
underlines or numbers nearby (the- or the25). We need to specify that we 
want instances in which there are no alphabetic letters on either side of the 
the: 


/["a-z] [tT]he [ "a-z]/ 
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But there is still one more problem with this pattern: it won’t find the 
word the when it begins a line. This is because the regular expression [ a- 
z ], which we used to avoid embedded the s, implies that there must be some 
single (although non-alphabetic) character before the the. We can avoid 
this by specifying that before the the we require either the beginning-of-line 
or a non-alphabetic character: 

/ r I Ta-z]) [ tT] he [ "a-z ] / 

A More Complex Example 

Let’s try out a more significant example of the power of REs. Suppose we 
want to build an application to help a user buy a computer on the web. The 
user might want ‘any PC with more than 500 Mhz and 32 Gb of disk space 
for less than $1000’. In order to do this kind of retrieval we will first need to 
be able to look for expressions like 500 MHz or 3.5 Gb or 32 Megabytes , or 
Compaq or Mac or $999.99. In the rest of this section we’ll work out some 
simple regular expressions for this task. 

First, let’s complete our regular expression for prices. Here’s a regular 
expression for a dollar sign followed by a string of digits. Note that Perl is 
smart enough to realize that $ here doesn’t mean end-of-line; how might it 
know that? 

/$ [0-9]+/ 

Now we just need to deal with fractions of dollars. We’ll add a decimal 
point and two digits afterwards: 

/$[0-9]+\. [0-9] [0-9]/ 

This pattern only allows $199.99 but not $199. We need to make the 
cents optional, and make sure we’re at a word boundary: 

Ab$ [0-9] + (\ . [0-9] [0-9] ) ?\b/ 

How about specifications for processor speed (in Megahertz = Mhz or 
Gigahertz = Ghz)? Here’s a pattern for that: 

/\b [0-9]+,_,* (Mhz | [Mm] egahertz | Ghz | [Gg] igahertz) \b/ 

Note that we use /,_,*/ to mean ’zero or more spaces’, since there 
might always be extra spaces lying around. Dealing with disk space (in Gb 
= gigabytes), or memory size (in Mb = megabytes or Gb = gigabytes), we 
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need to allow for optional gigabyte fractions again (5.5 Gb). Note the use of 
? for making the final s optional: 

/\b [0-9] (Mb | [Mm] egabytes?) \b/ 

Ab [0-9] ( \ . [0-9] + )?,_* (Gb | [Gg] igabytes?) \b/ 

Finally, we might want some simple patterns to specify operating sys¬ 
tems and vendors: 

Ab (Win |Win95 |Win98 | WinNT | Windows^* (NT | 95 | 98) ?) \b/ 
Ab (Mac |Macintosh | Apple) \b/ 

Advanced Operators 


RE 

Expansion 

Match 

Example Patterns 

\d 

[0-9] 

any digit 

P ar ty l _ 1 of l _ 1 5 

\D 

["0-9] 

any non-digit 

Blue^moon 

\w 

[a-zA-Z0-9 l _ 1 ] 

any alphanumeric or space 

Daiyu 

\w 

[ "\w] 

a non-alphanumeric 

!!!! 

\s 

[^XrXtXnXf] 

whitespace (space, tab) 


\s 

r\s] 

Non-whitespace 

in^Concord 

Figure 2.6 Aliases for common sets of characters. 


There are also some useful advanced regular expression operators. Fig¬ 
ure 2.6 shows some useful aliases for common ranges, which can be used 
mainly to save typing. Besides the Kleene * and Kleene +, we can also use 
explicit numbers as counters, by enclosing them in curly brackets. The reg¬ 
ular expression / { 3 } / means “exactly 3 occurrences of the previous char¬ 
acter or expression”. So /a\. { 2 4 } z / will match a followed by 24 dots 
followed by z (but not a followed by 23 or 25 dots followed by a z). 

A range of numbers can also be specified; so / {n, m} / specifies from 
n to m occurrences of the previous char or expression, while / {n, } / means 
at least n occurrences of the previous expression. REs for counting are sum¬ 
marized in Figure 2.7. 

Finally, certain special characters are referred to by special notation 
newline based on the backslash (\). The most common of these are the newline 
character \n and the tab character \ t. To refer to characters that are special 
themselves, (like ., [, and \), precede them with a backslash, (i.e. / \ . /, 

A*/, A [ /, and A\/). 
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RE 

Match 

* 

+ 

7 

{n} 
{n,m} 
{n, } 

zero or more occurrences of the previous char or expression 
one or more occurrences of the previous char or expression 
exactly zero or one occurrence of the previous char or expression 
n occurrences of the previous char or expression 
from n to m occurrences of the previous char or expression 
at least n occurrences of the previous char or expression 

Figure 2.7 Regular expression operators for counting. 


RE 

Match 

Example Patterns Matched 

\* 

\. 

\? 

\n 

\t 

an asterisk 
a period “.” 
a question mark 
a newline 
a tab 

“K*A*P*L*A*N” 

“Dr. Livingston, I presume” 
“Would you light my candle?” 

Figure 2.8 Some characters that need to be backslashed. 


The reader should consult Appendix A for further details of regular 
expressions, and especially for the differences between regular expressions 
in Perl, UNIX, and Microsoft Word. 

Regular Expression Substitution, Memory, and ELIZA 

An important use of regular expressions is in substitutions. For example, the fio B N S s TITU 
Perl substitution operator s/regexpl/regexp2/ allows a string charac¬ 
terized by one regular expression to be replaced by a string characterized by 
a different regular expression: 

s/colour/color/ 

It is often useful to be able to refer to a particular subpart of the string 
matching the first pattern. For example, suppose we wanted to put angle 
brackets around all integers in a text, changing e.g. the 35 boxes to the 
<35> boxes. We’d like a way to refer back to the integer we’ve found so 
that we can easily add the brackets. To do this, we put parentheses ( and 
) around the first pattern, and use the number operator \ 1 in the second 
pattern to refer back. Here’s how it looks: 


s/([0-9]+)/<\1>/ 
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The parenthesis and number operators can also be used to specify that 
a certain string or expression must occur twice in the text. For example, 
suppose we are looking for the pattern ‘the Xer they were, the Xer they will 
be’, where we want to constrain the two X’s to be the same string. We do 
this by surrounding the first X with the parenthesis operator, and replacing 
the second X with the number operator \ 1, as follows: 

/the (,*)er they were, the \ler they will be/ 

Here the \ 1 will be replaced by whatever string matched the first item in 
parentheses. So this will match The bigger they were, the bigger they will be 
but not The bigger they were, the faster they will be. 

The number operator can be used with other numbers: if you match 
two different sets of parenthesis, \2 means whatever matched the second 
set. For example 

/the (.*)er they (.*), the \ler they \2/ 

will match The bigger they were, the bigger they were but not The bigger they 
registers were, the bigger they will be. These numbered memories are called registers 
(e.g. register 1, register 2, register 3, etc). This memory feature is not part 
of every regular expression language, and is often considered an ‘extended’ 
feature of regular expressions. 

Substitutions using memory are very useful in implementing a simple 
natural-language understanding program like ELIZA (Weizenbaum, 1966). 
Recall that ELIZA simulated a Rogerian psychologist, and could carry on 
conversations with the user like the following: 

Useri: Men are all alike. 

ELIZAi: IN WHAT WAY 

User 2 : They’re always bugging us about something or other.. 

ELIZA 2 : CAN YOU THINK OF A SPECIFIC EXAMPLE 
User 3 : Well, my boyfriend made me come here. 

ELIZA 3 : YOUR BOYFRIEND MADE YOU COME HERE 
User 4 : He says I’m depressed much of the time. 

ELIZA 4 : I AM SORRY TO HEAR YOU ARE DEPRESSED. 

Eliza worked by having a cascade of regular expression substitutions 
that each matched some part of the input lines and changed them. The first 
substitutions changed all instances of my to YOUR , and Tm to YOU ARE , 
and so on. The next set of substitutions looked for relevant patterns in the 
input and created an appropriate output; here are some examples: 

s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ 



Section 2.2. 


Finite-State Automata 


33 


s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ 
s/.* all .*/IN WHAT WAY/ 

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ 

Since multiple substitutions could apply to a given input, substitutions 
were assigned a rank and were applied in order. Creation of such patterns is 
addressed in Exercise 2.2. 

2.2 Finite-State Automata 


The regular expression is more than just a convenient metalanguage for text 
searching. First, a regular expression is one way of describing a finite-state 
automaton (FSA). Finite-state automata are the theoretical foundation of automaton e 
a good deal of the computational work we will describe in this book, Any fsa 
regular expression can be implemented as a finite-state automaton (except 
regular expressions that use the memory feature; more on this later). Sym¬ 
metrically, any finite-state automaton can be described with a regular expres¬ 
sion. Second, a regular expression is one way of characterizing a particular 
kind of formal language called a regular language. Both regular expres- language 
sions and finite-state automata can be used to described regular languages. 

The relation among these three theoretical constructions is sketched out in 
Figure 2.9. 


regular 

expressions 


finite - 
automata 


regular 

languages 


Figure 2.9 The relationship between finite automata, regular expressions, 
and regular languages; figure suggested by Martin Kay. 


This section will begin by introducing finite-state automata for some of 
the regular expressions from the last section, and then suggest how the map¬ 
ping from regular expressions to automata proceeds in general. Although 
we begin with their use for implementing regular expressions, FSAs have a 
wide variety of other uses which we will explore in this chapter and the next. 
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AUTOMATON 


STATE 

START STATE 


Using an FSA to Recognize Sheeptalk 

After a while, with the parrot's help, the Doctor got to learn the 
language of the animals so well that he could talk to them himself 
and understand everything they said. 

Hugh Lofting, The Story of Doctor Dolittle 

Let’s begin with the ‘sheep language’ we discussed previously. Recall 
that we defined the sheep language as any string from the following (infinite) 
set: 


baa! 

baaa! 

baaaa! 

baaaaa! 

baaaaaa! 



The regular expression for this kind of ‘sheep talk’ is /baa+ ! /. Fig¬ 
ure 2.10 shows an automaton for modeling this regular expression. The 
automaton (i.e. machine, also called finite automaton, finite-state automa¬ 
ton, or FSA) recognizes a set of strings, in this case the strings characterizing 
sheep talk, in the same way that a regular expression does. We represent the 
automaton as a directed graph: a finite set of vertices (also called nodes), 
together with a set of directed links between pairs of vertices called arcs. 
We’ll represent vertices with circles and arcs with arrows. The automaton 
has five states, which are represented by nodes in the graph. State 0 is the 
start state which we represent by the incoming arrow. State 4 is the final 
state or accepting state, which we represent by the double circle. It also has 
four transitions, which we represent by arcs in the graph. 

The FSA can be used for recognizing (we also say accepting) strings 
in the following way. First, think of the input as being written on a long tape 
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broken up into cells, with one symbol written in each cell of the tape, as in 
Figure 2.11. 



The machine starts in the start state (qo), and iterates the following 
process: Check the next letter of the input. If it matches the symbol on 
an arc leaving the current state, then cross that arc, move to the next state, 
and also advance one symbol in the input. If we are in the accepting state 
(# 4 ) when we run out of input, the machine has successfully recognized an 
instance of sheeptalk. If the machine never gets to the final state, either 
because it runs out of input, or it gets some input that doesn’t match an arc 
(as in Figure 2.11), or if it just happens to get stuck in some non-final state, 
we say the machine rejects or fails to accept an input. 

We can also represent an automaton with a state-transition table. As 
in the graph notation, the state-transition table represents the start state, the 
accepting states, and what transitions leave each state with which symbols. 
Here’s the state-transition table for the FSA of Figure 2.10. 



Input 

State 

b 

a 

1 

0 

1 

0 

0 

1 

0 

2 

0 

2 

0 

3 

0 

3 

0 

3 

4 

4: 

0 

0 

0 


Figure 2.12: The state-transition table for the FSA of Figure 2.10 


REJECTS 

STATE- 

TRANSITION 

TABLE 


We’ve marked state 4 with a colon to indicate that it’s a final state (you 
can have as many final states as you want), and the 0 indicates an illegal or 
missing transition. We can read the first row as “if we’re in state 0 and we 
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see the input b we must go to state 1. If we’re in state 0 and we see the input 
a or !, we fail”. 

More formally, a finite automaton is defined by the following 5 param¬ 
eters: 

• Q: a finite set of N states #o, q \,... Qn 

• Z: a finite input alphabet of symbols 

• q{)\ the start state 

• F: the set of final states, F CQ 

• 8 (< 7 - /): the transition function or transition matrix between states. Given 

a state q E Q and an input symbol i E Z, 8 (qj) returns a new state 

E <2. 8 is thus a relation from Q x Z to Q; 

For the sheeptalk automaton in Figure 2.10, Q = {#o,<? 3 ,# 4 }, 

Z = {a,Z>,!}, F = {# 4 }, and 8 (#, /) is defined by the transition table in Fig¬ 
ure 2 . 12 . 

Figure 2.13 presents an algorithm for recognizing a string using a state- 
transition table. The algorithm is called d-recognize for ‘deterministic 
recognizer’. A deterministic algorithm is one that has no choice points; 
the algorithm always knows what to do for any input. The next section will 
introduce non-deterministic automata that must make decisions about which 
states to move to. 

d-recognize takes as input a tape and an automaton. It returns ac¬ 
cept if the string it is pointing to on the tape is accepted by the automaton, 
and reject otherwise. Note that since d-recognize assumes it is already 
pointing at the string to be checked, its task is only a subpart of the general 
problem that we often use regular expressions for, finding a string in a corpus 
(the general problem is left as an exercise to the reader in Exercise 2.8). 

d-recognize begins by initializing the variables index and current- 
state to the beginning of the tape and the machine’s initial state, d-recognize 
then enters a loop that drives the rest of the algorithm. It first checks whether 
it has reached the end of its input. If so, it either accepts the input (if the cur¬ 
rent state is an accept state) or rejects the input (if not). 

If there is input left on the tape, d-recognize looks at the transition 
table to decide which state to move to. The variable current-state indicates 
which row of the table to consult, while the current symbol on the tape indi¬ 
cates which column of the table to consult. The resulting transition-table cell 
is used to update the variable current-state and index is incremented to move 
forward on the tape. If the transition-table cell is empty then the machine 
has nowhere to go and must reject the input. 
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function D-RECOGmzE(tape, machine ) returns accept or reject 

indexer- Beginning of tape 
current-state f- Initial state of machine 

loop 

if End of input has been reached then 
if current-state is an accept state then 
return accept 
else 

return reject 

elsif transition-table [current-state, tape [index 77 is empty then 
return reject 
else 

current-state <- transition-table [ current-state, tape [ index]] 
index <- index + 1 

end 


Figure 2.13 An algorithm for deterministic recognition of FSAs. This al¬ 
gorithm returns accept if the entire string it is pointing at is in the language 
defined by the FSA, and reject if the string is not in the language. 


Figure 2.14 traces the execution of this algorithm on the sheep lan¬ 
guage FSA given the sample input string baaal. 






i_UL 

a a 

a ! 

3 

Figure 2.14 Tracing the execution of FSA #1 on some sheeptalk. 


Before examining the beginning of the tape, the machine is in state qo. 
Finding a b on input tape, it changes to state q\ as indicated by the contents 
of transition-table[qo,b] in Figure 2.12 on page 35. It then finds an a and 
switches to state qi* another a puts it in state # 3 , a third a leaves it in state 43 , 
where it reads the T, and switches to state # 4 . Since there is no more input, 
the End of input condition at the beginning of the loop is satisfied for 
the first time and the machine halts in cp. State cp is an accepting state, 
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and so the machine has accepted the string baaa! as a sentence in the sheep 
language. 

The algorithm will fail whenever there is no legal transition for a given 
combination of state and input. The input abc will fail to be recognized since 
there is no legal transition out of state qo on the input a, (i.e. this entry of 
the transition table in Figure 2.12 on page 35 has a 0). Even if the automaton 
had allowed an initial a it would have certainly failed on c, since c isn’t even 
in the sheeptalk alphabet!). We can think of these ‘empty’ elements in the 
table as if they all pointed at one ‘empty’ state, which we might call the fail 
fail state state or sink state. In a sense then, we could view any machine with empty 
transitions as if we had augmented it with a fail state, and drawn in all the 
extra arcs, so we always had somewhere to go from any state on any possible 
input. Just for completeness, Figure 2.15 shows the FSA from Figure 2.10 
with the fail state q f filled in. 



Formal Languages 

We can use the same graph in Figure 2.10 as an automaton for generating 
sheeptalk. If we do, we would say that the automaton starts at state qo , and 
crosses arcs to new states, printing out the symbols that label each arc it 
follows. When the automaton gets to the final state it stops. Notice that at 
state 3, the automaton has to chose between printing out a ! and going to 
state 4, or printing out an a and returning to state 3. Let’s say for now that 
we don’t care how the machine makes this decision; maybe it flips a coin. 
For now, we don’t care which exact string of sheeptalk we generate, as long 
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as it’s a string captured by the regular expression for sheep talk above. 

Key Concept #l.Formal Language: A model which can both gener¬ 
ate and recognize all and only the strings of a formal language acts as 

a definition of the formal language. 

A formal language is a set of strings, each string composed of symbols language 
from a finite symbol-set called an alphabet (the same alphabet used above alphabet 
for defining an automaton!). The alphabet for the sheep language is the set 
£ = {a, fe,!}. Given a model m (such as a particular FSA), we can use L(m) 
to mean “the formal language characterized by rtf . So the formal language 
defined by our sheeptalk automaton m in Figure 2.10 (and Figure 2.12) is the 
infinite set: 

L(m) = {baa\,baaa\,baaaa\,baaaaa\,baaaaaa\...} (2.1) 

The usefulness of an automaton for defining a language is that it can 
express an infinite set (such as this one above) in a closed form. Formal 
languages are not the same as natural languages, which are the kind of languages 
languages that real people speak. In fact a formal language may bear no re¬ 
semblance at all to a real language (for example a formal language can be 
used to model the different states of a soda machine). But we often use a 
formal language to model part of a natural language, such as parts of the 
phonology, morphology, or syntax. The term generative grammar is some¬ 
times used in linguistics to mean a grammar of a formal language; the origin 
of the term is this use of an automaton to define a language by generating all 
possible strings. 

Another Example 

In the previous examples our formal alphabet consisted of letters; but we 
can also have a higher-level alphabet consisting of words. In this way we 
can write finite-state automata that model facts about word combinations. 

For example, suppose we wanted to build an FSA that modeled the subpart 
of English dealing with amounts of money. Such a formal language would 
model the subset of English consisting of phrases like ten cents , three dol¬ 
lars , one dollar thirty-five cents and so on. 

We might break this down by first building just the automaton to ac¬ 
count for the numbers from one to ninety-nine, since we’ll need them to deal 
with cents. Figure 2.16 shows this. 
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one 

six 

ten 

sixty 

eleven 

sixteen 


two 

seven 

twenty 

seventy 

twelve 

seventeen 


three 

eight 

thirty 

eighty 

thirteen 

eighteen 


four 

nine 

forty 

ninety 

fourteen 

nineteen 


five 


fifty 


fifteen 





sixty 
seventy 1 


^^one 



jT 1 

twenty 

thirty 

fa ) 

1 two 

seven {/ 


xy 


' three 

eight VV 

forty 

eighty 


four 

• v\ 

nine \ 


fifty 

ninety 


five 


Figure 2.16 

An FSA for the words for English numbers 1-99. 



We could now add cents and dollars to our automaton. Figure 2.17 
shows a simple version of this, where we just made two copies of the au¬ 
tomaton in Figure 2.16 and appended the words cents and dollars. 


one six ten sixty eleven sixteen 

two seven twenty seventy twelve seventeen 

three eight thirty eighty thirteen eighteen 

four nine forty ninety fourteen nineteen 

five fifty fifteen 


one six ten sixty eleven sixteen 

two seven twenty seventy twelve seventeen 

three eight thirty eighty thirteen eighteen 

four nine forty ninety fourteen nineteen 

five fifty fifteen 



Figure 2.17 FSA for the simple dollars and cents. 


We would now need to add in the grammar for different amounts of 
dollars; including higher numbers like hundred , thousand. We’d also need to 
make sure that the nouns like cents and dollars are singular when appropriate 
(one cent , one dollar ), and plural when appropriate (ten cents , two dollars). 
This is left as an exercise for the reader (Exercise 2.3). We can think of the 
FSAs in Figure 2.16 and Figure 2.17 as simple grammars of parts of English. 
We will return to grammar-building in Part II of this book, particularly in 
Chapter 9. 


Nondeterministic FSAs 

deterministic Let’s extend our discussion now to another class of FSAs: non-deterministic 
FSAs (or NFS As). Consider the sheeptalk automaton in Figure 2.18, which 
is much like our first automaton in Figure 2.10: 


NFSA 
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Figure 2.18 A non-determiniStic finite-state automaton for talking sheep 
(NFSA #1). Compare with the deterministic automaton in Figure 2.10. 


The only difference between this automaton and the previous one is 
that here in Figure 2.18 the self-loop is on state 2 instead of state 3. Con¬ 
sider using this network as an automaton for recognizing sheeptalk. When 
we get to state 2, if we see an a we don’t know whether to remain in state 
2 or go on to state 3. Automata with decision points like this are called 
non-deterministic FSAs (or NFSAs). Recall by contrast that Figure 2.10 
specified a deterministic automaton, i.e. one whose behavior during recog¬ 
nition is fully determined by the state it is in and the symbol it is looking at. 
A deterministic automaton can be referred to as a DFSA. That is not true for 
the machine in Figure 2.18 (NFSA #1). 

There is another common type of non-determinism, which can be caused 
by arcs that have no symbols on them (called £- transitions). The automaton 
in Figure 2.19 defines the exact same language as the last one, or our first 
one, but it does it with an £-transition. 



We interpret this new arc as follows: if we are in state 3, we are al¬ 
lowed to move to state 2 without looking at the input, or advancing our input 
pointer. So this introduces another kind of non-determinism - we might not 
know whether to follow the £-transition or the ! arc. 


NON- 

DETERMINISTIC 

FSA 

NFSA 


DFSA 


£ 

TRANSITIONS 



42 


Chapter 2. Regular Expressions and Automata 


Using an NFSA to accept strings 

If we want to know whether a string is an instance of sheeptalk or not, and 
if we use a non-deterministic machine to recognize it, we might follow the 
wrong arc and reject it when we should have accepted it. That is, since there 
is more than one choice at some point, we might take the wrong choice. This 
problem of choice in non-deterministic models will come up again and again 
as we build computational models, particularly for parsing. 

There are three standard solutions to this problem: 

• Backup: Whenever we come to a choice point, we could put a marker 
to mark where we were in the input, and what state the automaton was 
in. Then if it turns out that we took the wrong choice, we could back 
up and try another path. 

• Look-ahead: We could look ahead in the input to help us decide which 
path to take. 

• Parallelism: Whenever we come to a choice point, we could look at 
every alternative path in parallel. 

We will focus here on the backup approach and defer discussion of the 
look-ahead and parallelism approaches to later chapters. 

The backup approach suggests that we should blithely make choices 
that might lead to deadends, knowing that we can always return to unex¬ 
plored alternative choices. There are two keys to this approach: we need 
to remember all the alternatives for each choice point, and we need to store 
sufficient information about each alternative so that we can return to it when 
necessary. When a backup algorithm reaches a point in its processing where 
no progress can be made (because it runs out of input, or has no legal tran¬ 
sitions), it returns to a previous choice point, selects one of the unexplored 
alternatives, and continues from there. Applying this notion to our non- 
deterministic recognizer, we need only remember two things for each choice 
point: the state, or node, of the machine that we can go to and the corre¬ 
sponding position on the tape. We will call the combination of the node and 
position the search-state of the recognition algorithm. To avoid confusion, 
we will refer to the state of the automaton (as opposed to the state of the 
search) as a node or a machine-state. Figure 2.21 presents a recognition 
algorithm based on this approach. 

Before going on to describe the main part of this algorithm, we should 
note two changes to the transition table that drives it. First, in order to rep¬ 
resent nodes that have outgoing £-transitions, we add a new £-column to the 
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Input 

State 

b 

a 

1 

8 

0 

1 

0 

0 

0 

1 

0 

2 

0 

0 

2 

0 

2,3 

0 

0 

3 

0 

0 

4 

0 

4: 

0 

0 

0 

0 


Figure 2.20: The transition table from NFSA #1 in Figure 2.18. 

transition table. If a node has an 8 -transition, we list the destination node in 
the e-column for that node’s row. The second addition is needed to account 
for multiple transitions to different nodes from the same input symbol. We 
let each cell entry consist of a list of destination nodes rather than a single 
node. Figure 2.20 shows the transition table for the machine in Figure 2.18 
(NFSA #1). While it has no 8 -transitions, it does show that in machine-state 
#2 the input a can lead back to #2 or on to q 3 . 

Figure 2.21 shows the algorithm for using a non-deterministic FSA 
to recognize an input string. The function nd-recognize uses the variable 
agenda to keep track of all the currently unexplored choices generated during 
the course of processing. Each choice (search state) is a tuple consisting of a 
node (state) of the machine and a position on the tape. The variable current- 
search-state represents the branch choice being currently explored. 

nd-recognize begins by creating an initial search-state and placing 
it on the agenda. For now we don’t specify what order the search-states are 
placed on the agenda. This search-state consists of the initial machine-state 
of the machine and a pointer to the beginning of the tape. The function NEXT 
is then called to retrieve an item from the agenda and assign it to the variable 
current-search-state . 

As with D-RECOGNIZE, the first task of the main loop is to determine 
if the entire contents of the tape have been successfully recognized. This 
is done via a call to accept-STATE?, which returns accept if the current 
search-state contains both an accepting machine-state and a pointer to the 
end of the tape. If we’re not done, the machine generates a set of possible 
next steps by calling generate-new-STATES, which creates search-states 
for any 8 -transitions and any normal input-symbol transitions from the tran¬ 
sition table. All of these search-state tuples are then added to the current 
agenda. 

Finally, we attempt to get a new search-state to process from the agenda. 
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If the agenda is empty we’ve run out of options and have to reject the input. 
Otherwise, an unexplored option is selected and the loop continues. 

It is important to understand why nd-recognize returns a value of 
reject only when the agenda is found to be empty. Unlike d-recognize, it 
does not return reject when it reaches the end of the tape in an non-accept 
machine-state or when it finds itself unable to advance the tape from some 
machine-state. This is because, in the non-deterministic case, such road¬ 
blocks only indicate failure down a given path, not overall failure. We can 
only be sure we can reject a string when all possible choices have been ex¬ 
amined and found lacking. 

Figure 2.22 illustrates the progress of nd-recognize as it attempts to 
handle the input baa a !. Each strip illustrates the state of the algorithm at 
a given point in its processing. The current-search-state variable is captured 
by the solid bubbles representing the machine-state along with the arrow rep¬ 
resenting progress on the tape. Each strip lower down in the figure represents 
progress from one current-search-state to the next. 

Little of interest happens until the algorithm finds itself in state q 2 
while looking at the second a on the tape. An examination of the entry 
for transition-table[# 2 ,a] returns both q 2 and # 3 . Search states are created 
for each of these choices and placed on the agenda. Unfortunately, our al¬ 
gorithm chooses to move to state # 3 , a move that results in neither an accept 
state nor any new states since the entry for transition-table [ 43 , a] is empty. 
At this point, the algorithm simply asks the agenda for a new state to pursue. 
Since the choice of returning to q 2 from q 2 is the only unexamined choice on 
the agenda it is returned with the tape pointer advanced to the next a. Some¬ 
what diabolically, nd-recognize finds itself faced with the same choice. 
The entry for transition-table \q 2 ,a] still indicates that looping back to q 2 or 
advancing to q 3 are valid choices. As before, states representing both are 
placed on the agenda. These search states are not the same as the previous 
ones since their tape index values have advanced. This time the agenda pro¬ 
vides the move to q 3 as the next move. The move to # 4 , and success, is then 
uniquely determined by the tape and the transition-table. 

Recognition as Search 

nd-recognize accomplishes the task of recognizing strings in a regular 
language by providing a way to systematically explore all the possible paths 
through a machine. If this exploration yields a path ending in an accept 
state, it accepts the string, otherwise it rejects it. This systematic exploration 



Section 2.2. 


Finite-State Automata 


45 


function ND-RECOGNiZEtepe, machine) returns accept or reject 

agendas- {(Initial state of machine, beginning of tape)} 
current-search-state *— N ext (agenda) 

loop 

if Accept-State l(current-search-state) returns true then 
return accept 
else 

agenda*-agenda U Generate-New-States (current-search-state) 
if agenda is empty then 
return reject 
else 

current-search-state *— N ext (agenda) 

end 

function GENERATE-NEw-STATES(current-state) returns a set of search- 
states 

current-node <r- the node the current search-state is in 
index*- the point on the tape the current search-state is looking at 
return a list of search states from transition table as follows: 
(transition-table[current-node, e], index) 

U 

(transition-table [current-node, tape [index]], index + 7) 
function Accept-State l(search-state) returns true or false 

current-node ir- the node search-state is in 

indexer- the point on the tape search-state is looking at 

if index is at the end of the tape and current-node is an accept state of machine 

then 

return true 
else 

return false 


Figure 2.21 An algorithm for NFS A recognition. The word node means 
a state of the FSA, while state or search-state means ‘the state of the search 
process’, i.e. a combination of node and tape-position 


is made possible by the agenda mechanism, which on each iteration selects a 
partial path to explore and keeps track of any remaining, as yet unexplored, 
partial paths. 

Algorithms such as ND-RECOGNIZE, which operate by systematically 
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searching for solutions, are known as state-space search algorithms. In 
such algorithms, the problem definition creates a space of possible solu¬ 
tions; the goal is to explore this space, returning an answer when one is 
found or rejecting the input when the space has been exhaustively explored. 
In nd-recognize, search states consist of pairings of machine-states with 
positions on the input tape. The state-space consists of all the pairings of 
machine-state and tape positions that are possible given the machine in ques¬ 
tion. The goal of the search is to navigate through this space from one state to 
another looking for a pairing of an accept state with an end of tape position. 

The key to the effectiveness of such programs is often the order in 
which the states in the space are considered. A poor ordering of states may 
lead to the examination of a large number of unfruitful states before a suc¬ 
cessful solution is discovered. Unfortunately, it is typically not possible to 
tell a good choice from a bad one, and often the best we can do is to insure 
that each possible solution is eventually considered. 
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Careful readers may have noticed that the ordering of states in nd- 
recognize has been left unspecified. We know only that unexplored states 
are added to the agenda as they are created and that the (undefined) func¬ 
tion Next returns an unexplored state from the agenda when asked. How 
should the function Next be defined? Consider an ordering strategy where 
the states that are considered next are the most recently created ones. Such 
a policy can be implemented by placing newly created states at the front 
of the agenda and having Next return the state at the front of the agenda 
when called. Thus the agenda is implemented by a stack. This is commonly 
referred to as a depth-first search or Last In First Out (LIFO) strategy. 

Such a strategy dives into the search space following newly developed 
leads as they are generated. It will only return to consider earlier options 
when progress along a current lead has been blocked. The trace of the ex¬ 
ecution of ND-RECOGNIZE on the string baaa ! as shown in Figure 2.22 
illustrates a depth-first search. The algorithm hits the first choice point after 
seeing ba when it has to decide whether to stay in q 2 or advance to state 
# 3 . At this point, it chooses one alternative and follows it until it is sure it’s 
wrong. The algorithm then backs up and tries another older alternative. 

Depth first strategies have one major pitfall: under certain circum¬ 
stances they can enter an infinite loop. This is possible either if the search 
space happens to be set up in such a way that a search-state can be acciden¬ 
tally re-visited, or if there are an infinite number of search states. We will 
revisit this question when we turn to more complicated search problems in 
parsing in Chapter 10. 

The second way to order the states in the search space is to consider 
states in the order in which they are created. Such a policy can be imple¬ 
mented by placing newly created states at the back of the agenda and still 
have Next return the state at the front of the agenda. Thus the agenda is 
implemented via a queue. This is commonly referred to as a breadth-first 
search or First In First Out (FIFO) strategy. Consider a different trace 
of the execution of nd-recognize on the string baaa ! as shown in Fig¬ 
ure 2.23. Again, the algorithm hits its first choice point after seeing ba when 
it had to decide whether to stay in q 2 or advance to state 43 . But now rather 
than picking one choice and following it up, we imagine examining all pos¬ 
sible choices, expanding one ply of the search tree at a time. 

Like depth-first search, breadth-first search has its pitfalls. As with 
depth-first if the state-space is infinite, the search may never terminate. More 
importantly, due to growth in the size of the agenda if the state-space is 
even moderately large, the search may require an unpractically large amount 
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of memory. For small problems, either depth-first or breadth-first search 
strategies may be adequate, although depth-first is normally preferred for its 
more efficient use of memory. For larger problems, more complex search 
techniques such as dynamic programming or A* must be used, as we will 
see in Chapter 7 and Chapter 10. 

Relating Deterministic and Non-deterministic Automata 

It may seem that allowing NFSAs to have non-deterministic features like e- 
transitions would make them more powerful than DFSAs. In fact this is not 
the case; for any NFSA, there is an exactly equivalent DFSA. In fact there is 
a simple algorithm for converting an NFSA to an equivalent DFSA, although 
the number of states in this equivalent deterministic automaton may be much 
larger. See Lewis and Papadimitriou (1981) or Hopcroft and Ullman (1979) 
for the proof of the correspondence. The basic intuition of the proof is worth 
mentioning, however, and builds on the way NFSAs parse their input. Recall 
that the difference between NFSAs and DFSAs is that in an NFSA a state q- } 
may have more than one possible next state given an input i (for example q a 
and qb). The algorithm in Figure 2.21 dealt with this problem by choosing 
either q a or qy and then backtracking if the choice turned out to be wrong. 
We mentioned that a parallel version of the algorithm would follow both 
paths (toward q a and qy) simultaneously. 
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The algorithm for converting a NFSA to a DFSA is like this parallel 
algorithm; we build an automaton that has a deterministic path for every path 
our parallel recognizer might have followed in the search space. We imagine 
following both paths simultaneously, and group together into an equivalence 
class all the states we reach on the same input symbol (i.e. q a and qb). We 
now give a new state label to this new equivalence class state (for example 
q a b). We continue doing this for every possible input for every possible group 
of states. The resulting DFSA can have as many states as there are distinct 
sets of states in the original NFSA. The number of different subsets of a set 
with N elements is 2 N , hence the new DFSA can have as many as 1 N states. 


2.3 Regular Languages and FSAs 

As we suggested above, the class of languages that are definable by regular 
expressions is exactly the same as the class of languages that are character- 
izable by finite-state automata (whether deterministic or non-deterministic). 

Because of this, we call these languages the regular languages. In order to languages 

give a formal definition of the class of regular languages, we need to refer 

back to two earlier concepts: the alphabet X, which is the set of all symbols in 

the language, and the empty string £, which is conventionally not included in 

X. In addition, we make reference to the empty set 0 (which is distinct from 

8). The class of regular languages (or regular sets) over X is then formally 

as follows: 1 

1. 0 is a regular language 

2. Va G X U £, {a} is a regular language 

3. If Li and L 2 are regular languages, then so are: 

(a) L\ - La = {xy\x 6 L\,y e L 2 }, the concatenation of L\ andL 2 

(b) L\ UL 2 , the union or disjunction of L\ andL 2 

(c) L*, the Kleene closure of L\ 

All and only the sets of languages which meet the above properties 
are regular languages. Since the regular languages are the set of languages 
characterizable by regular expressions, it must be the case that all the regu¬ 
lar expression operators introduced in this chapter (except memory) can be 
implemented by the three operations which define regular languages: con- 

L Following van Santen and Sproat (1998), Kaplan and Kay (1994), and Lewis and Pa- 
padimitriou (1981). 
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catenation, disjunction/union (also called ‘|’), and Kleene closure. For ex¬ 
ample all the counters (*,+, {n, m}) are just a special case of repetition plus 
Kleene *. All the anchors can be thought of as individual special symbols. 
The square braces [ ] are a kind of disjunction (i.e. [ ab ] means “a or or 
the disjunction of a and b). Thus it is true that any regular expression can be 
turned into a (perhaps larger) expression which only makes use of the three 
primitive operations. 

Regular languages are also closed under the following operations (where 
E* means the infinite set of all possible strings formed from the alphabet E): 

• intersection: if L\ and L 2 are regular languages, then so is L\ n L 2 , the 
language consisting of the set of strings that are in both L\ and L 2 . 

• difference: if L\ and L 2 are regular languages, then so is L\ — L 2 , the 
language consisting of the set of strings that are in L\ but not L 2 . 

• complementation: If L\ is a regular language, then so is E* — Li, the 
set of all possible strings that aren’t in L\ 

• reversal: If L\ is a regular language, then so is L*, the language con¬ 
sisting of the set of reversals of all the strings in L \. 

The proof that regular expressions are equivalent to finite-state au¬ 
tomata can be found in Hopcroft and Ullman (1979), and has two parts: 
showing that an automaton can be built for each regular language, and con¬ 
versely that a regular language can be built for each automaton. We won’t 
give the proof, but we give the intuition by showing how to do the first part: 
take any regular expression and build an automaton from it. The intuition is 
inductive: for the base case we build an automaton to correspond to regular 
expressions of a single symbol (e.g. the expression a) by creating an initial 
state and an accepting final state, with an arc between them labeled a. For 
the inductive step, we show that each of the primitive operations of a regular 
expression (concatenation, union, closure) can be imitated by an automaton: 

• concatenation: We just string two FSAs next to each other by con¬ 
necting all the final states of FSAi to the initial state of FSA 2 by an 
8-transition. 

• closure: We connect all the final states of the FSA back to the initial 
states by 8-transitions (this implements the repetition part of the Kleene 
*), and then put direct links between the initial and final states by 8- 
transitions (this implements the possibly of having zero occurrences). 
We’d leave out this last part to implement Kleene-plus instead. 

• union: We add a single new initial state q f 0 , and add new transitions 
from it to all the former initial states of the two machines to be joined. 
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2.4 Summary 

This chapter introduced the most important fundamental concept in language 
processing, the finite automaton, and the practical tool based on automaton, 
the regular expression. Here’s a summary of the main points we covered 
about these ideas: 

• the regular expression language is a powerful tool for pattern-matching. 
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• basic operations in regular expressions include concatenation of sym¬ 
bols, disjunction of symbols ([], |, and .)> counters (*, +, and 
{n , m}), anchors (~, $) and precedence operators ((,)). 

• any regular expression can be realized as a finite automaton. 

• memory ( \ 1 together with ()) is an advanced operation which is often 
considered part of regular expressions, but which cannot be realized as 
a finite automaton. 

• an automaton implicitly defines a formal language as the set of strings 
the automaton accepts. 

• an automaton can use any set of symbols for its vocabulary, including 
letters, words, or even graphic images. 

• the behavior of a deterministic automata (DFSA) is fully determined 
by the state it is in. 

• a non-deterministic (NFSA) automata sometimes has to make a choice 
between multiple paths to take given the same current state and next in¬ 
put. 

• any NFSA can be converted to a DFSA. 

• the order in which a NFSA chooses the next state to explore on the 
agenda defines its search strategy. The depth-first search or LIFO 
strategy corresponds to the agenda-as-stack; the breadth-first search 
or FIFO strategy corresponds to the agenda-as-queue. 

• any regular expression can be automatically compiled into a NFSA and 
hence into a FSA. 


Bibliographical and Historical Notes 

Finite automata arose in the 1950’s out of Turing’s (1936) model of algo¬ 
rithmic computation, considered by many to be the foundation of modem 
computer science. The Turing machine was an abstract machine with a finite 
control and an input/output tape. In one move, the Turing machine could 
read a symbol on the tape, write a different symbol on the tape, change state, 
and move left or right. (Thus the Turing machine differs from a finite-state 
automaton mainly in its ability to change the symbols on its tape). 

Inspired by Turing’s work, McCulloch and Pitts built an automata-like 
model of the neuron (see von Neumann, 1963, p. 319). Their model, which 
is now usually called the McCulloch-Pitts neuron (McCulloch and Pitts, 
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1943), was a simplified model of the neuron as a kind of ‘computing ele¬ 
ment’ that could be described in terms of propositional logic. The model 
was a binary device, at any point either active or not, which took excitatory 
and inhibitatory input from other neurons and fired if its activation passed 
some fixed threshold. Based on the McCulloch-Pitts neuron, Kleene (1951) 
and (1956) defined the finite automaton and regular expressions, and proved 
their equivalence. Non-deterministic automata were introduced by Rabin 
and Scott (1959), who also proved them equivalent to deterministic ones. 

Ken Thompson was one of the first to build regular expressions compil¬ 
ers into editors for text searching (Thompson, 1968). His editor ed included 
a command “g/regular expression/p”, or Global Regular Expression Print, 
which later became the UNIX grep utility. 

There are many general-purpose introductions to the mathematics un¬ 
derlying automata theory; such as Hopcroft and Ullman (1979) and Lewis 
and Papadimitriou (1981). These cover the mathematical foundations the 
simple automata of this chapter, as well as the finite-state transducers of 
Chapter 3, the context-free grammars of Chapter 9, and the Chomsky hier¬ 
archy of Chapter 13. Friedl (1997) is a very useful comprehensive guide to 
the advanced use of regular expressions. 

The metaphor of problem-solving as search is basic to Artificial Intel¬ 
ligence (AI); more details on search can be found in any AI textbook such as 
Russell and Norvig (1995). 


Exercises 

2.1 Write regular expressions for the following languages: You may use 
either Perl notation or the minimal ‘algebraic’ notation of Section 2.3, but 
make sure to say which one you are using. By ‘word’, we mean an alphabetic 
string separated from other words by white space, any relevant punctuation, 
line breaks, etc. 

a. the set of all alphabetic strings. 

b. the set of all lowercase alphabetic strings ending in a b. 

c. the set of all strings with two consecutive repeated words (for example 
‘Humbert Humbert’ and ‘the the’ but not ‘the bug’ or ‘the big bug’). 
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d. the set of all strings from the alphabet a, b such that each a is immedi¬ 
ately preceded and immediately followed by a b. 

e. all strings which start at the beginning of the line with an integer (i.e. 
1,2,3...10...10000...) and which end at the end of the line with a word. 

f. all strings which have both the word grotto and the word raven in them, 
(but not, for example, words like grottos that merely contain the word 
grotto). 

g. write a pattern which places the first word of an English sentence in a 
register. Deal with punctuation. 

2.2 Implement an ELIZA-like program, using substitutions such as those 
described on page 32. You may choose a different domain than a Rogerian 
psychologist, if you wish, although keep in mind that you would need a 
domain in which your program can legitimately do a lot of simple repeating- 
back. 

2.3 Complete the FSA for English money expressions in Figure 2.16 as 
suggested in the text following the figure. You should handle amounts up 
to $100,000, and make sure that “cent” and “dollar” have the proper plural 
endings when appropriate. 

2.4 Design an FSA that recognizes simple date expressions like March 75, 
the 22nd of November, Christmas. You should try to include all such ‘ab¬ 
solute’ dates, (e.g. not ‘deictic’ ones relative to the current day like the day 
before yesterday). Each edge of the graph should have a word or a set of 
words on it. You should use some sort of shorthand for classes of words to 
avoid drawing too many arcs (e.g. Furniture desk, chair, table) 

2.5 Now extend your date FSA to handle deictic expressions like yesterday, 
tomorrow, a week from tomorrow, the day before yesterday, Sunday, next 
Monday, three weeks from Saturday. 

2.6 Write an FSA for time-of-day expressions like eleven o ’clock, twelve- 
thirty, midnight, or a quarter to ten and others. 

2.7 Write a regular expression for the language accepted by the NFSA in 
Figure 2.27 

2.8 Currently the function D-RECOGNIZE in Figure 2.13 only solves a sub¬ 
part of the important problem of finding a string in some text. Extend the 
algorithm to solve the following two deficiencies: (1) d-recognize cur¬ 
rently assumes that it is already pointing at the string to be checked. (2) 



Section 2.4. Summary 


55 



D-RECOGNIZE fails if the string it is pointing includes as a proper substring 
a legal string for the FSA. That is, d-recognize fails if there is an extra 
character at the end of the string. 

2.9 Give an algorithm for negating a deterministic FSA. The negation of an 
FSA accepts exactly the set of strings that the original FSA rejects (over the 
same alphabet), and rejects all the strings that the original FSA accepts. 

2.10 Why doesn’t your previous algorithm work with NFSAs? Now extend 
your algorithm to negate an NFSA. 
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A writer is someone who writes , and a stinger is something that 
stings. But fingers don’t fing, grocers don’t groce, haberdash¬ 
ers don’t haberdash, hammers don’t ham , and humdingers don’t 
humding. 

Richard Lederer, Crazy English 


Chapter 2 introduced the regular expression, showing for example how 
a single search string could help a web search engine find both woodchuck 
and woodchucks. Hunting for singular or plural woodchucks was easy; the 
plural just tacks an s on to the end. But suppose we were looking for another 
fascinating woodland creatures; let’s say a fox, and a fish , that surly peccary 
and perhaps a Canadian wild goose. Hunting for the plurals of these animals 
takes more than just tacking on an s. The plural of fox is foxes; of peccary , 
peccaries ; and of goose , geese. To confuse matters further, fish don’t usually 
change their form when they are plural (as Dr. Seuss points out: one fish two 
fish , red fish, blue fish). 

It takes two kinds of knowledge to correctly search for singulars and 
plurals of these forms. Spelling rules tell us that English words ending in -y 
are pluralized by changing the -y to -i- and adding an -es. Morphological 
rules tell us that fish has a null plural, and that the plural of goose is formed 
by changing the vowel. 

The problem of recognizing that foxes breaks down into the two mor¬ 
phemes fox and -es is called morphological parsing. 

Key Concept #2. Parsing means taking an input and producing some parsing 

sort of structure for it. 


We will use the term parsing very broadly throughout this book, including 
many kinds of structures that might be produced; morphological, syntactic, 
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semantic, pragmatic; in the form of a string, or a tree, or a network. In 
the information retrieval domain, the similar (but not identical) problem of 
mapping from foxes to fox is called stemming. Morphological parsing or 
stemming applies to many affixes other than plurals; for example we might 
need to take any English verb form ending in -ing {going, talking , congrat¬ 
ulating) and parse it into its verbal stem plus the -ing morpheme. So given 
the surface or input form going , we might want to produce the parsed form 
VERB-go + GERUND-ing. This chapter will survey the kinds of mor¬ 
phological knowledge that needs to be represented in different languages and 
introduce the main component of an important algorithm for morphological 
parsing: the finite-state transducer. 

Why don’t we just list all the plural forms of English nouns, and all the 
-ing forms of English verbs in the dictionary? The major reason is that -ing 
is a productive suffix; by this we mean that it applies to every verb. Simi¬ 
larly -s applies to almost every noun. So the idea of listing every noun and 
verb can be quite inefficient. Furthermore, productive suffixes even apply to 
new words (so the new word fax automatically can be used in the -ing form: 
faxing). Since new words (particularly acronyms and proper nouns) are cre¬ 
ated every day, the class of nouns in English increases constantly, and we 
need to be able to add the plural morpheme -s to each of these. Additionally, 
the plural form of these new nouns depends on the spelling/pronunciation of 
the singular form; for example if the noun ends in -z then the plural form is 
-es rather than -s. We’ll need to encode these rules somewhere. Finally, we 
certainly cannot list all the morphological variants of every word in morpho¬ 
logically complex languages like Turkish, which has words like the follow¬ 
ing: 

(3.1) uygarla§tiramadiklarimizdanmig smizcasma 

uygar +la§ +tir +ama +dik +lar +inuz 

civilized -KB EC +CAUS +NEGABLE +PPART +PL +PlPL 

+dan +mis +siniz +casina 
+ABL +PAST +2PL +Aslf 

‘(behaving) as if you are among those whom we could not 
civilize/cause to become civilized’ 

The various pieces of this word (the morphemes) have these meanings: 

+bec is ‘become’ in English 

+caus is the causative voice marker on a verb 

+NegAble is ‘notable’ in English 
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+PPart marks a past participle form 

+p 1 pl is 1 st person pi possessive agreement 

+2pl is 2nd person pl 

+abl is the ablative (from/among) case marker 

+Aslf is a derivational marker that forms an adverb from a finite verb form 

In such languages we clearly need to parse the input since it is impos¬ 
sible to store every possible word. Kemal Oflazer (p.c.), who came up with 
this example, notes that verbs in Turkish have 40,000 forms not counting 
derivational suffixes; adding derivational suffixes allows a theoretically in¬ 
finite number of words. This is true because for example any verb can be 
‘causativized’ like the example above, and multiple instances of causativiza- 
tion can be embedded in a single word (you cause X to cause Y to .... do W). 
Not all Turkish words look like this; Oflazer finds that the average Turkish 
word has about three morphemes (a root plus two suffixes). Even so, the 
fact that such words are possible means that it will be difficult to store all 
possible Turkish words in advance. 

Morphological parsing is necessary for more than just information re¬ 
trieval. We will need it in machine translation to realize that the French 
words va and aller should both translate to forms of the English verb go. 
We will also need it in spell checking; as we will see, it is morphological 
knowledge that will tell us that misclam and antiundoggingly are not words. 

The next sections will summarize morphological facts about English 
and then introduce the finite-state transducer. 

3.1 Survey of (Mostly) English Morphology 

Morphology is the study of the way words are built up from smaller meaning¬ 
bearing units, morphemes. A morpheme is often defined as the minimal 
meaning-bearing unit in a language. So for example the word fox consists of 
a single morpheme (the morpheme fox ) while the word cats consists of two: 
the morpheme cat and the morpheme -s. 

As this example suggests, it is often useful to distinguish two broad 
classes of morphemes: stems and affixes. The exact details of the distinc¬ 
tion vary from language to language, but intuitively, the stem is the ‘main’ 
morpheme of the word, supplying the main meaning, while the affixes add 
‘additional’ meanings of various kinds. 

Affixes are further divided into prefixes, suffixes, infixes, and circum- 
fixes. Prefixes precede the stem, suffixes follow the stem, circumfixes do 
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both, and infixes are inserted inside the stem. For example, the word eats is 
composed of a stem eat and the suffix -s. The word unbuckle is composed of 
a stem buckle and the prefix un-. English doesn’t have any good examples 
of circumfixes, but many other languages do. In German, for example, the 
past participle of some verbs formed by adding ge- to the beginning of the 
stem and -t to the end; so the past participle of the verb sagen (to say) is 
gesagt (said). Infixes, in which a morpheme is inserted in the middle of a 
word, occur very commonly for example in the Philipine language Tagalog. 
For example the affix wm, which marks the agent of an action, is infixed to 
the Tagalog stem hingi ‘borrow’ to produce humingi. There is one infix that 
occurs in some dialects of English in which the taboo morpheme ‘f**king’ 
or others like it are inserted in the middle of other words (‘Man-f**king- 
hattan’) (McCawley, 1978). 

Prefixes and suffixes are often called concatenative morphology since 
a word is composed of a number of morphemes concatenated together. A 
number of languages have extensive non-concatenative morphology, in 
which morphemes are combined in more complex ways. The Tagalog in¬ 
fixation example above is one example of non-concatenative morphology, 
since two morphemes (hingi and um) are intermingled. Another kind of 
non-concatenative morphology is called templatic morphology or root- 
and-pattem morphology. This is very common in Arabic, Hebrew, and 
other Semitic languages. In Hebrew, for example, a verb is constructed us¬ 
ing two components: a root, consisting usually of three consonants (CCC) 
and carrying the basic meaning, and a template, which gives the ordering of 
consonants and vowels and specifies more semantic information about the 
resulting verb, such as the semantic voice (e.g. active, passive, middle). For 
example the Hebrew tri-consonantal root Imd, meaning ‘learn’ or ‘study’, 
can be combined with the active voice CaCaC template to produce the word 
lamad , ‘he studied’, or the intensive CiCeC template to produce the word 
limed , ‘he taught’, or the intensive passive template CuCaC to produce the 
word lumad , ‘he was taught’. 

A word can have more than one affix. For example, the word rewrites 
has the prefix re -, the stem write , and the suffix -s. The word unbelievably 
has a stem (believe) plus three affixes ( un -, - able , and -ly). While English 
doesn’t tend to stack more than 4 or 5 affixes, languages like Turkish can 
have words with 9 or 10 affixes, as we saw above. Languages that tend to 
string affixes together like Turkish does are called agglutinative languages. 

There are two broad (and partially overlapping) classes of ways to form 
inflection words from morphemes: inflection and derivation. Inflection is the combi- 
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nation of a word stem with a grammatical morpheme, usually resulting in a 
word of the same class as the original stem, and usually filling some syntac¬ 
tic function like agreement. For example, English has the inflectional mor¬ 
pheme -s for marking the plural on nouns, and the inflectional morpheme 
-ed for marking the past tense on verbs. Derivation is the combination of a 
word stem with a grammatical morpheme, usually resulting in a word of a 
different class, often with a meaning hard to predict exactly. For example the 
verb computerize can take the derivational suffix -ation to produce the noun 
computerization . 

Inflectional Morphology 

English has a relatively simple inflectional system; only nouns, verbs, and 
sometimes adjectives can be inflected, and the number of possible inflec¬ 
tional affixes is quite small. 

English nouns have only two kinds of inflection: an affix that marks 
plural and an affix that marks possessive. For example, many (but not all) 
English nouns can either appear in the bare stem or singular form, or take a 
plural suffix. Here are examples of the regular plural suffix -s, the alternative 
spelling - es , and irregular plurals: 



Regular Nouns 

Irregular Nouns 

Singular 

cat 

thrush 

mouse 

ox 

Plural 

cats 

thrushes 

mice 

oxen 


While the regular plural is spelled -s after most nouns, it is spelled -es 
after words ending in -s {ibis/ibises ), -z, (waltz/waltzes) -sir (thrush/thrushes) 
-ch, (finch/finches) and sometimes -x (box/boxes). Nouns ending in -y pre¬ 
ceded by a consonant change the -y to -i ( butterfly/butterflies ). 

The possessive suffix is realized by apostrophe + -s for regular singular 
nouns ( llama's ) and plural nouns not ending in -s (children's) and often by a 
lone apostrophe after regular plural nouns ( llamas' ) and some names ending 
in -y or -z (Euripides' comedies). 

English verbal inflection is more complicated than nominal inflection. 
First, English has three kinds of verbs; main verbs, (eat, sleep, impeach ), 
modal verbs (can, will, should ), and primary verbs (be, have, do) (using 
the terms of Quirk et al. , 1985a). In this chapter we will mostly be concerned 
with the main and primary verbs, because it is these that have inflectional 
endings. Of these verbs a large class are regular, that is to say all verbs of 


PLURAL 

SINGULAR 


REGULAR 
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this class have the same endings marking the same functions. These regular 
verbs (e.g. walk , or inspect ), have four morphological forms, as follow: 


Morphological Form Classes 

Regularly Inflected Verbs 

stem 

-s form 
-ing participle 

Past form or -ed participle 

walk 

walks 

walking 

walked 

merge 

merges 

merging 

merged 

try 

tries 

trying 

tried 

map 

maps 

mapping 

mapped 


These verbs are called regular because just by knowing the stem we 
can predict the other forms, by adding one of three predictable endings, and 
making some regular spelling changes (and as we will see in Chapter 4, reg¬ 
ular pronunciation changes). These regular verbs and forms are significant in 
the morphology of English first because they cover a majority of the verbs, 
and second because the regular class is productive. As discussed earlier, a 
productive class is one that automatically includes any new words that enter 
the language. For example the recently-created verb fax {My mom faxed me 
the note from cousin Everett ), takes the regular endings - ed , - ing , -es. (Note 
that the -s form is spelled faxes rather than faxs; we will discuss spelling 
rules below). 

ver e b g s ular The irregular verbs are those that have some more or less idiosyn¬ 

cratic forms of inflection. Irregular verbs in English often have five different 
forms, but can have as many as eight (e.g. the verb be) or as few as three (e.g. 
cut or hit). While constituting a much smaller class of verbs (Quirk et al. 
(1985a) estimate there are only about 250 irregular verbs, not counting aux¬ 
iliaries), this class includes most of the very frequent verbs of the language. 1 
The table below shows some sample irregular forms. Note that an irregular 

preterite verb can inflect in the past form (also called the preterite) by changing its 
vowel ( eat/ate ), or its vowel and some consonants 0 catch/caught ), or with no 
ending at all {cut/cut). 


1 In general, the more frequent a word form, the more likely it is to have idiosyncratic 
properties; this is due to a fact about language change; very frequent words preserve their 
form even if other words around them are changing so as to become more regular. 
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Morphological Form Classes 

Irregularly Inflected Verbs 

stem 

-s' form 
-ing participle 

Past form 
-ed participle 

eat 

eats 

eating 

ate 

eaten 

catch 

catches 

catching 

caught 

caught 

cut 

cuts 

cutting 

cut 

cut 


The way these forms are used in a sentence will be discussed in Chap¬ 
ters 8-12 but is worth a brief mention here. The -s form is used in the ‘ha¬ 
bitual present’ form to distinguish the 3rd-person singular ending (She jogs 
every Tuesday) from the other choices of person and number (I/you/we/they 
jog every Tuesday). The stem form is used in the infinitive form, and also 
after certain other verbs (Td rather walk home , I want to walk home). The 
-ing participle is used when the verb is treated as a noun; this particular 
kind of nominal use of a verb is called a gerund use: Fishing is fine if you 
live near water. The -ed participle is used in the perfect construction (He's 
eaten lunch already) or the passive construction (The verdict was overturned 
yesterday.). 

In addition to noting which suffixes can be attached to which stems, 
we need to capture the fact that a number of regular spelling changes occur 
at these morpheme boundaries. For example, a single consonant letter is 
doubled before adding the -ing and -ed suffixes (beg/begging/begged). If the 
final letter is ‘c’, the doubling is spelled ‘ck’ (picnic/picnicking/picnicked). 
If the base ends in a silent -e, it is deleted before adding -ing and -ed (merge/- 
merging/merged). Just as for nouns, the -s ending is spelled -es after verb 
stems ending in -s (toss/tosses) , -z, (waltz/waltzes) -sh , (wash/washes) -ch , 
(catch/catches) and sometimes -x (tax/taxes). Also like nouns, verbs ending 
in -y preceded by a consonant change the -y to -i (try/tries). 

The English verbal system is much simpler than for example the Eu¬ 
ropean Spanish system, which has as many as fifty distinct verb forms for 
each regular verb. Figure 3.1 shows just a few of the examples for the verb 
amar , ‘to love’. Other languages can have even more forms than this Spanish 
example. 


GERUND 

PERFECT 


Derivational Morphology 

While English inflection is relatively simple compared to other languages, 
derivation in English is quite complex. Recall that derivation is the combi- 
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Present 

Indicative 

Imper. 

Imperfect 

Indicative 

Future 

Preterite 

Present 

Subjnct. 

Conditional 

Imperfect 

Subjnct. 

Future 

Subjnct. 

amo 


amaba 

amare 

ame 

ame 

amaria 

amara 

amare 

amas 

ama 

ames 

amabas 

amaras 

amaste 

ames 

amarfas 

amaras 

amares 

ama 


amaba 

amara 

amo 

ame 

amaria 

amara 

amareme 

amamos 


amabamos 

amaremos 

amamos 

amemos 

amarfamos 

amaramos 

amaremos 

amais 

amad 

amais 

amabais 

amareis 

amasteis 

ameis 

amarfais 

amarais 

amareis 

aman 


amaban 

amaran 

amaron 

amen 

amarfan 

amaran 

amaren 


Figure 3.1 To love in Spanish. 


nation of a word stem with a grammatical morpheme, usually resulting in a 
word of a different class, often with a meaning hard to predict exactly. 

A very common kind of derivation in English is the formation of new 
nouns, often from verbs or adjectives. This process is called nominalization. 
For example, the suffix -ation produces nouns from verbs ending often in the 
suffix -ize (computerize —> computerization). Here are examples of some 
particularly productive English nominalizing suffixes. 


Suffix 

Base Verb/Adjective 

Derived Noun 

-ation 

-ee 

-er 

-ness 

computerize (V) 
appoint (V) 
kill (V) 
fuzzy (A) 

computerization 

appointee 

killer 

fuzziness 


Adjectives can also be derived from nouns and verbs. Here are exam¬ 
ples of a few suffixes deriving adjectives from nouns or verbs. 


Suffix 

Base Noun/Verb 

Derived Adjective 

-al 

-able 

-less 

computation (N) 
embrace (V) 
clue (N) 

computational 

embraceable 

clueless 


Derivation in English is more complex than inflection for a number of 
reasons. One is that it is generally less productive; even a nominalizing suf¬ 
fix like - ation , which can be added to almost any verb ending in -ize, cannot 
be added to absolutely every verb. Thus we can’t say *eatation or *spella- 
tion (we use an asterisk (*) to mark ‘non-examples’ of English). Another 
is that there are subtle and complex meaning differences among nominaliz- 
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ing suffixes. For example sincerity has a subtle difference in meaning from 
sincereness. 

3.2 Finite-State Morphological Parsing 

Let’s now proceed to the problem of parsing English morphology. Consider 
a simple example: parsing just the productive nominal plural (-s) and the 
verbal progressive (-ing). Our goal will be to take input forms like those in 
the first column below and produce output forms like those in the second 
column. 


Input 

Morphological Parsed Output 

cats 

cat +N +PL 

cat 

cat +N +SG 

cities 

city +N +PL 

geese 

goose +N +PL 

goose 

(goose +N + SG) or (goose +V) 

gooses 

goose +V + 3SG 

merging 

merge +V +PRES-PART 

caught 

(catch +V +PAST-PART) or (catch +V +PAST) 


The second column contains the stem of each word as well as assorted 
morphological features. These features specify additional information about 
the stem. For example the feature +N means that the word is a noun; + SG 
means it is singular, +PL that it is plural. We will discuss features in Chap¬ 
ter 11; for now, consider +SG to be a primitive unit that means ‘singular’. 
Note that some of the input forms (like caught or goose) will be ambiguous 
between different morphological parses. 

In order to build a morphological parser, we’ll need at least the follow¬ 
ing: 

1 . a lexicon: The list of stems and affixes, together with basic information 
about them (whether a stem is a Noun stem or a Verb stem, etc). 

2. morphotactics: the model of morpheme ordering that explains which 
classes of morphemes can follow other classes of morphemes inside a 
word. For example, the rule that the English plural morpheme follows 
the noun rather than preceding it. 

3. orthographic rules: these spelling rules are used to model the changes 
that occur in a word, usually when two morphemes combine (for ex- 
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ample the y ie spelling rule discussed above that changes city + -s 
to cities rather than citys). 

The next part of this section will discuss how to represent a simple ver¬ 
sion of the lexicon just for the sub-problem of morphological recognition, 
including how to use FSAs to model morphotactic knowledge. We will then 
introduce the finite-state transducer (FST) as a way of modeling morpholog¬ 
ical features in the lexicon, and addressing morphological parsing. Finally, 
we show how to use FSTs to model orthographic rules. 


The Lexicon and Morphotactics 

A lexicon is a repository for words. The simplest possible lexicon would 
consist of an explicit list of every word of the language ( every word, i.e. 
including abbreviations (‘AAA’) and proper names (‘Jane’ or ‘Beijing’) as 
follows: 

a 

AAA 

AA 

Aachen 

aardvark 

aardwolf 

aba 

abaca 

aback 


Since it will often be inconvenient or impossible, for the various rea¬ 
sons we discussed above, to list every word in the language, computational 
lexicons are usually structured with a list of each of the stems and affixes of 
the language together with a representation of the morphotactics that tells us 
how they can fit together. There are many ways to model morphotactics; one 
of the most common is the finite-state automaton. A very simple finite-state 
model for English nominal inflection might look like Figure 3.2. 

The FSA in Figure 3.2 assumes that the lexicon includes regular nouns 
(reg-noun) that take the regular -s plural (e.g. cat, dog , fox , aardvark). 
These are the vast majority of English nouns since for now we will ignore 
the fact that the plural of words like fox have an inserted e: foxes. The 
lexicon also includes irregular noun forms that don’t take -s, both singular 
irreg-sg-noun (goose, mouse) and plural irreg-pl-noun (geese, mice). 
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reg-noun plural (-s) 



Figure 3.2 A finite-state automaton for English nominal inflection. 


reg-noun 

irreg-pl-noun 

irreg-sg-noun 

plural 

fox 

geese 

goose 

-s 

cat 

sheep 

sheep 


dog 

mice 

mouse 


aardvark 





A similar model for English verbal inflection might look like Fig¬ 
ure 3.3. 



This lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, 
and irreg-past-verb-form), plus 4 more affix classes (-ed past, -ed participle, 
-ing participle, and 3rd singular -s): 
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reg-verb- 

stem 

irreg-verb- 

stem 

irreg-past- 

verb 

past 

past-part 

pres-part 

3sg 

walk 

fry 

talk 

impeach 

cut 

speak 

sing 

sang 

cut 

spoken 

caught 

ate 

eaten 

-ed 

-ed 

-ing 

-s 


English derivational morphology is significantly more complex than 
English inflectional morphology, and so automata for modeling English deriva¬ 
tion tend to be quite complex. Some models of English derivation, in fact, 
are based on the more complex context-free grammars of Chapter 9 (Sproat, 
1993; Orgun, 1995). 

As a preliminary example, though, of the kind of analysis it would 
require, we present a small part of the morphotactics of English adjectives, 
taken from Antworth (1990). Antworth offers the following data on English 
adjectives: 

big, bigger, biggest 
cool, cooler, coolest, coolly 
red, redder, reddest 

clear, clearer, clearest, clearly, unclear, unclearly 
happy, happier, happiest, happily 
unhappy, unhappier, unhappiest, unhappily 
real, unreal, really 

An initial hypothesis might be that adjectives can have an optional pre¬ 
fix ( un -), an obligatory root (big, cool , etc) and an optional suffix (-er, -est, 
or -ly). This might suggest the the FSA in Figure 3.4. 

Alas, while this FSA will recognize all the adjectives in the table above, 
it will also recognize ungrammatical forms like unbig, redly , and realest. 
We need to set up classes of roots and specify which can occur with which 
suffixes. So adj-rooti would include adjectives that can occur with un- and 
-ly (clear, happy , and real) while adj-root 2 will include adjectives that can’t 
(big, cool, and red). Antworth (1990) presents Figure 3.5 as a partial solution 
to these problems. 

This gives an idea of the complexity to be expected from English 
derivation. For a further example, we give in Figure 3.6 another fragment 
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of an FSA for English nominal and verbal derivational morphology, based 
on Sproat (1993), Bauer (1983), and Porter (1980). This FSA models a 
number of derivational facts, such as the well known generalization that any 
verb ending in -ize can be followed by the nominalizing suffix -ation (Bauer, 
1983; Sproat, 1993)). Thus since there is a word fossilize , we can predict the 
word fossilization by following states qo, q\, and Similarly, adjectives 
ending in -al or -able at q$ (equal, formal, realizable) can take the suffix -ity, 
or sometimes the suffix -ness to state q& (naturalness, casualness ). We leave 
it as an exercise for the reader (Exercise 3.2) to discover some of the indi¬ 
vidual exceptions to many of these constraints, and also to give examples of 
some of the various noun and verb classes. 

We can now use these FSAs to solve the problem of morphological 
recognition; that is, of determining whether an input string of letters makes 
up a legitimate English word or not. We do this by taking the morphotactic 
FSAs, and plugging in each ‘sub-lexicon’ into the FSA. That is, we expand 
each arc (e.g. the reg-noun-stem arc) with all the morphemes that make up 
the set of reg-noun-stem. The resulting FSA can then be defined at the level 
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varks by simply starting at the initial state, and comparing the input letter by 
letter with each word on each outgoing arc, etc., just as we saw in Chapter 2. 

Morphological Parsing with Finite-State Transducers 

Now that we’ve seen how to use FSAs to represent the lexicon and inciden¬ 
tally do morphological recognition, let’s move on to morphological parsing. 

For example, given the input cats , we’d like to output cat +N +P L, telling 
us that cat is a plural noun. We will do this via a version of two-level mor- two-level 
phology, first proposed by Koskenniemi (1983). Two level morphology rep¬ 
resents a word as a correspondence between a lexical level, which represents 
a simple concatenation of morphemes making up a word, and the surface surface 
level, which represents the actual spelling of the final word. Morphological 
parsing is implemented by building mapping rules that map letter sequences 
like cats on the surface level into morpheme and features sequences like 
cat +N +PL on the lexical level. Figure 3.8 shows these two levels for the 
word cats . Note that the lexical level has the stem for a word, followed by 
the morphological information +N +PL which tells us that cats is a plural 
noun. 


Lexical \ 

c 

a 

t 

+N 

+PL 



□ 


Surface ^_ 

c 

a 

t 

S 




Q 

Figure 3.8 Example of the lexical and surface tapes. 


The automaton that we use for performing the mapping between these 
two levels is the finite-state transducer or FST. A transducer maps between fst 
one set of symbols and another; a finite-state transducer does this via a fi¬ 
nite automaton. Thus we usually visualize an FST as a two-tape automaton 
which recognizes or generates pairs of strings. The FST thus has a more 
general function than an FSA; where an FSA defines a formal language by 
defining a set of strings, an FST defines a relation between sets of strings. 

This relates to another view of an FST; as a machine that reads one string 
and generates another, Here’s a summary of this four-fold way of thinking 
about transducers: 

• FST as recognizer: a transducer that takes a pair of strings as input 
and outputs accept if the string-pair is in the string-pair language, and 
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REGULAR 
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a reject if it is not. 

• FST as generator: a machine that outputs pairs of strings of the lan¬ 
guage. Thus the output is a yes or no, and a pair of output strings. 

• FST as translator: a machine that reads a string and outputs another 
string. 

• FST as set relater: a machine that computes relations between sets. 

An FST can be formally defined in a number of ways; we will rely 
on the following definition, based on what is called the Mealy machine 
extension to a simple FSA: 

• Q: a finite set of N states # 1 , • • • #/v 

• Z: a finite alphabet of complex symbols. Each complex symbol is 
composed of an input-output pair i : o; one symbol i from an input 
alphabet /, and one symbol o from an output alphabet 0, thus E C 
/ x 0. I and 0 may each also include the epsilon symbol £. 

• qo: the start state 

• F : the set of final states, F CQ 

• 8 (q,i : o): the transition function or transition matrix between states. 
Given a state q^Q and complex symbol i : o £ Z, 8 (#, i : 6) returns a 
new state q f E Q. 8 is thus a relation from Q x Z to Q; 

Where an FSA accepts a language stated over a finite alphabet of single 
symbols, such as the alphabet of our sheep language: 

Z = {M,!} (3.2) 

an FST accepts a language stated over pairs of symbols, as in: 

Z = {a : a, b : b, ! :!, a :!, a : £, £ :!} (3.3) 

In two-level morphology, the pairs of symbols in Z are also called feasible 
pairs. 

Where FSAs are isomorphic to regular languages, FSTs are isomor¬ 
phic to regular relations. Regular relations are sets of pairs of strings, a 
natural extension of the regular languages, which are sets of strings. Like 
FSAs and regular languages, FSTs and regular relations are closed under 
union, although in general they are not closed under difference, complemen¬ 
tation and intersection (although some useful subclasses of FSTs are closed 
under these operations; in general FSTs that are not augmented with the ^ 
are more likely to have such closure properties). Besides union, FSTs have 
two additional closure properties that turn out to be extremely useful: 
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inversion • inversion: the inversion of a transducer T ( T~ l ) simply switches the 

input and output labels. Thus if T maps from the input alphabet / to 
the output alphabet O , T~ l maps from O to /. 

• composition: if 7j is a transducer from I\ to ()\ and T 2 a transducer 
from I 2 to 0 2 , then T\ o T 2 maps from l\ to 0 2 . 

Inversion is useful because it makes it easy to convert a FST-as-parser 
into an FST-as-generator. Composition is useful because it allows us to take 
two transducers that run in series and replace them with one more complex 
transducer. Composition works as in algebra; applying T\ o T 2 to an input 
sequence S is identical to applying 7j to S and then T 2 to the result; thus 
T\ o T 2 (S) = T 2 (Ti(S)). We will see examples of composition below. 

We mentioned that for two-level morphology it’s convenient to view 
an FST as having two tapes. The upper or lexical tape, is composed from 
characters from the left side of the a : b pairs; the lower or surface tape, 
is composed of characters from the right side of the a : b pairs. Thus each 
symbol a : b in the transducer alphabet £ expresses how the symbol a from 
one tape is mapped to the symbol b on the another tape. For example a : £ 
means that an a on the upper tape will correspond to nothing on the lower 
tape. Just as for an FSA, we can write regular expressions in the complex 
alphabet £. Since it’s most common for symbols to map to themselves, in 
two-level morphology we call pairs like a : a default pairs, and just refer to 
them by the single letter a. 

We are now ready to build an FST morphological parser out of our 
earlier morphotactic FSAs and lexica by adding an extra “lexical” tape and 
the appropriate morphological features. Figure 3.9 shows an augmentation 
of Figure 3.2 with the nominal morphological features (+SG and +PL) that 
correspond to each morpheme. Note that these features map to the empty 
string £ or the word/morpheme boundary symbol # since there is no segment 
corresponding to them on the output tape. 

In order to use Figure 3.9 as a morphological noun parser, it needs to be 
augmented with all the individual regular and irregular noun stems, replacing 
the labels regular-noun-stem etc. In order to do this we need to update the 
lexicon for this transducer, so that irregular plurals like geese will parse into 
the correct stem goose +N +PL. We do this by allowing the lexicon to 
also have two levels. Since surface geese maps to underlying goose, the 
new lexical entry will be ‘g: g o:e o:e s:s e:e\ Regular forms are 
simpler; the two-level entry for fox will now be ‘f: f o:o x : x\ but by 
relying on the orthographic convention that f stands for f: f and so on, we 
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Figure 3.9 A transducer for English nominal number inflection T num . 
Since both q\ and q 2 are accepting states, regular nouns can have the plural 
suffix or not. The morpheme-boundary symbol ~ and word-boundary marker 
# will be discussed below. 


can simply refer to it as f o x and the form for geese as ‘ g o:e o:e s e \ 
Thus the lexicon will look only slightly more complex: 


reg-noun 

irreg-pl-noun 

irreg-sg-noun 

fox 

g o:e o:e s e 

goose 

cat 

sheep 

sheep 

dog 

m o:i u:e s:c e 

mouse 

aardvark 




Our proposed morphological parser needs to map from surface forms 
like geese to lexical forms like goose +N +SG. We could do this by cas¬ 
cading the lexicon above with the singular/plural automaton of Figure 3.9. 
Cascading two automata means running them in series with the output of 
the first feeding the input to the second. We would first represent the lexi¬ 
con of stems in the above table as the FST T stems of Figure 3.10. This FST 
maps e.g. dog to reg-noun-stem. In order to allow possible suffixes, T stems 
@ symbol in Figure 3.10 allows the forms to be followed by the wildcard @ symbol; 

@ : @ stands for ‘any feasible pair’. A pair of the form @ : x, for example will 
mean ‘any feasible pair which has x on the surface level’, and correspond¬ 
ingly for the form x: The output of this FST would then feed the number 
automaton T num . 

Instead of cascading the two transducers, we can compose them using 
the composition operator defined above. Composing is a way of taking a 
cascade of transducers with many different levels of inputs and outputs and 
converting them into a single ‘two-level’ transducer with one input tape and 
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one output tape. The algorithm for composition bears some resemblance to 
the algorithm for determinization of FSAs from page 49; given two automata 
T\ and T 2 with state sets Q\ and Q 2 and transition functions 8 i and 82 , we 
create a new possible state (x,y) for every pair of states x e Q\ and y e Q 2 . 
Then the new automaton has the transition function: 

83 ((x a ,y a ),i : o) = ( x b ,y b ) if 
3 c s.t. 81 {x a . i:c)= Xb 

and 82 (y a ,c:o)=y b (3.4) 

The resulting composed automaton, Tj ex = T num o T stems , is shown in 
Figure 3.11 (compare this with the FSA lexicon in Figure 3.7 on page 70). 2 
Note that the final automaton still has two levels separated by the :. Because 
the colon was reserved for these levels, we had to use the | symbol in T stems 
in Figure 3.10 to separate the upper and lower tapes. 



This transducer will map plural nouns into the stem plus the morpho¬ 
logical marker +PL, and singular nouns into the stem plus the morpheme 
+ SG. Thus a surface cats will map to cat +N +PL as follows: 

c:c a:a t:t +N:£ +PL: A s# 

That is, c maps to itself, as do a and t , while the morphological feature 
+N (recall that this means ‘noun’) maps to nothing ( 8 ), and the feature +PL 
(meaning ‘plural’) maps to ~ s. The symbol ~ indicates a morpheme bound- ■ 
ary, while the symbol # indicates a word boundary, Figure 3.12 refers to boundary 

2 Note that for the purposes of clear exposition Figure 3.11 has not been minimized in the # 
way that Figure 3.7 has. boundary 
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tapes with these morpheme boundary markers as intermediate tapes; the 
next section will show how the boundary marker is removed. 


Lexical \ 

t 

0 

X 

+N 

+PL 





Intermediate \ 

f 

0 

X 

A 

S 

# 


i 
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Figure 3.12 An example of the lexical and intermediate tapes. 


Orthographic Rules and Finite-State Transducers 

The method described in the previous section will successfully recognize 
words like aardvarks and mice. But just concatenating the morphemes won’t 
work for cases where there is a spelling change; it would incorrectly reject 
an input like foxes and accept an input like foxs. We need to deal with the 
fact that English often requires spelling changes at morpheme boundaries by 
introducing spelling rules (or orthographic rules). This section introduces 
a number of notations for writing such rules and shows how to implement 
the rules as transducers. Some of these spelling rules: 
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Name 

Description of Rule 

Example 

Consonant 

doubling 

E deletion 

E insertion 

Y replacement 
K insertion 

1-letter consonant doubled before -ingl-ed 

Silent e dropped before -ing and -ed 
e added after -s,-z,-x,-cft, -sh before -s 
-y changes to -ie before -v, -i before -ed 
verbs ending with vowel + -c add -k 

beg/begging 

make/making 

watch/watches 

try/tries 

panic/panicked 


We can think of these spelling changes as taking as input a simple 
concatenation of morphemes (the ‘intermediate output’ of the lexical trans¬ 
ducer in Figure 3.11) and producing as output a slightly-modified, (correctly- 
spelled) concatenation of morphemes. Figure 3.13 shows the three levels 
we are talking about: lexical, intermediate, and surface. So for example 
we could write an E-insertion rule that performs the mapping from the in¬ 
termediate to surface levels shown in Figure 3.13. Such a rule might say 


Lexical ^ | 

f 

0 

X 

+N 

+PL 



□ 


Intermediate ^ | 
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Figure 3.13 An example of the lexical, intermediate and surface tapes. 
Between each pair of tapes is a 2-level transducer; the lexical transducer of 
Figure 3.11 between the lexical and intermediate levels, and the E-insertion 
spelling rule between the intermediate and surface levels. The E-insertion 
spelling rule inserts an e on the surface tape when the intermediate tape has a 
morpheme boundary " followed by the morpheme -s. 


something like “insert an e on the surface tape just when the lexical tape has 
a morpheme ending in x (or z, etc) and the next morpheme is -s. Here’s a 
formalization of the rule: 



This is the rule notation of Chomsky and Halle (1968); a rule of the 
form a -A b!c _ d means ‘rewrite a as b when it occurs between c and 
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d\ Since the symbol £ means an empty transition, replacing it means in¬ 
serting something. The symbol " indicates a morpheme boundary. These 
boundaries are deleted by including the symbol ":e in the default pairs for 
the transducer; thus morpheme boundary markers are deleted on the surface 
level by default. (Recall that the colon is used to separate symbols on the in¬ 
termediate and surface forms). The # symbol is a special symbol that marks 
a word boundary. Thus (3.5) means ‘insert an e after a morpheme-final x, 
s, or z, and before the morpheme s\ Figure 3.14 shows an automaton that 
corresponds to this rule. 



The idea in building a transducer for a particular rule is to express only 
the constraints necessary for that rule, allowing any other string of symbols 
to pass through unchanged. This rule is used to insure that we can only 
see the £:e pair if we are in the proper context. So state q$, which models 
having seen only default pairs unrelated to the rule, is an accepting state, 
as is qi, which models having seen a z, s, or x. q 2 models having seen the 
morpheme boundary after the z, >v, or x, and again is an accepting state. State 
#3 models having just seen the E-insertion; it is not an accepting state, since 
the insertion is only allowed if it is followed by the s morpheme and then the 
end-of-word symbol #. 

The other symbol is used in Figure 3.14 to safely pass through any 
parts of words that don’t play a role in the E-insertion rule, other means 
‘any feasible pair that is not in this transducer’; it is thus a version of 
which is context-dependent in a transducer-by-transducer way. So for exam¬ 
ple when leaving state #o, we go to q\ on the z, >v, or x symbols, rather than 
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following the other arc and staying in q 0 . The semantics of other depends 
on what symbols are on other arcs; since # is mentioned on some arcs, it 
is (by definition) not included in other , and thus, for example, is explicitly 
mentioned on the arc from q 2 to qo- 

A transducer needs to correctly reject a string that applies the rule when 
it shouldn’t. One possible bad string would have the correct environment for 
the E-insertion, but have no insertion. State q$ is used to insure that the e 
is always inserted whenever the environment is appropriate; the transducer 
reaches qs only when it has seen an s after an appropriate morpheme bound¬ 
ary. If the machine is in state qs and the next symbol is #, the machine rejects 
the string (because there is no legal transition on # from qs). Figure 3.15 
shows the transition table for the rule which makes the illegal transitions 
explicit with the symbol. 


State \ Input 

s: s 

x: x 

z : z 

A .e 

8: e 

# 

other 

qo- 

i 

i 

i 

0 

- 

0 

0 

ft: 

i 

i 

i 

2 

- 
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0 

ft: 

5 

i 

i 

0 

3 

0 

0 

ft 

4 

- 

- 

- 

- 

- 

- 

q 4 
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- 

- 

- 

- 

0 

- 

ft 

1 

1 

1 

2 

- 

- 

0 

1 

Figure 3.15 

The state-transition table for E-insertion rule of Figure 3.14, 

extended from a similar transducer in Antworth (1990). 




The next section will show a trace of this E-insertion transducer run¬ 
ning on a sample input string. 


3.3 Combining FST Lexicon and Rules 

We are now ready to combine our lexicon and rule transducers for parsing 
and generating. Figure 3.16 shows the architecture of a two-level morphol¬ 
ogy system, whether used for parsing or generating. The lexicon transducer 
maps between the lexical level, with its stems and morphological features, 
and an intermediate level that represents a simple concatenation of mor¬ 
phemes. Then a host of transducers, each representing a single spelling rule 
constraint, all run in parallel so as to map between this intermediate level and 
the surface level. Putting all the spelling rules in parallel is a design choice; 
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we could also have chosen to run all the spelling rules in series (as a long 
cascade), if we slightly changed each rule. 
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Figure 3.16 Generating or Parsing with FST lexicon and rules 


The architecture in Figure 3.16 is a two-level cascade of transducers. 
Recall that a cascade is a set of transducers in series, in which the output 
from one transducer acts as the input to another transducer; cascades can 
be of arbitrary depth, and each level might be built out of many individual 
transducers. The cascade in Figure 3.16 has two transducers in series: the 
transducer mapping from the lexical to the intermediate levels, and the col¬ 
lection of parallel transducers mapping from the intermediate to the surface 
level. The cascade can be run top-down to generate a string, or bottom-up 
to parse it; Figure 3.17 shows a trace of the system accepting the mapping 
from fox's to foxes. 

The power of finite-state transducers is that the exact same cascade 
with the same state sequences is used when the machine is generating the 
surface tape from the lexical tape, or when it is parsing the lexical tape from 
the surface tape. For example, for generation, imagine leaving the Interme¬ 
diate and Surface tapes blank. Now if we run the lexicon transducer, given 
fox +N + PL, it will produce fox's# on the Intermediate tape via the same 
states that it accepted the Lexical and Intermediate tapes in our earlier exam¬ 
ple. If we then allow all possible orthographic transducers to run in parallel, 
we will produce the same surface tape. 

Parsing can be slightly more complicated than generation, because of 
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Figure 3.17 Accepting foxes: The lexicon transducer Ti ex from Figure 3.11 

cascaded with the E-insertion transducer in Figure 3.14. 


the problem of ambiguity. For example, foxes can also be a verb (albeit 
a rare one, meaning To baffle or confuse’), and hence the lexical parse for 
foxes could be fox +V +3SG as well as fox +N +PL. How are we to 
know which one is the proper parse? In fact, for ambiguous cases of this sort, 
the transducer is not capable of deciding. Disambiguating will require some 
external evidence such as the surrounding words. Thus foxes is likely to be 
a noun in the sequence I saw two foxes yesterday , but a verb in the sequence 
That trickster foxes me every time!. We will discuss such disambiguation 
algorithms in Chapter 8 and Chapter 17. Barring such external evidence, the 
best our transducer can do is just enumerate the possible choices; so we can 
transduce/oxGs# into both fox +V + 3SGandfox +N +PL. 

There is a kind of ambiguity that we need to handle: local ambiguity 
that occurs during the process of parsing. For example, imagine parsing the 
input verb assess . After seeing ass , our E-insertion transducer may propose 
that the e that follows is inserted by the spelling rule (for example, as far as 
the transducer is concerned, we might have been parsing the word asses). It 
is not until we don’t see the # after asses, but rather run into another s, that 
we realize we have gone down an incorrect path. 

Because of this non-determinism, FST-parsing algorithms need to in¬ 
corporate some sort of search algorithm. Exercise 3.8 asks the reader to 
modify the algorithm for non-deterministic FSA recognition in Figure 2.21 
in Chapter 2 to do FST parsing. 


AMBIGUITY 


DISAMBIGUAT¬ 

ING 
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Running a cascade, particularly one with many levels, can be unwieldy. 
Luckily, we’ve already seen how to compose a cascade of transducers in se¬ 
ries into a single more complex transducer. Transducers in parallel can be 
fN U TE°RSE T c 0N combined by automaton intersection. The automaton intersection algo- 
rithm just takes the Cartesian product of the states, i.e. for each state q-, in 
machine 1 and state qj in machine 2, we create a new state qij. Then for 
any input symbol a , if machine 1 would transition to state q n and machine 2 
would transition to state q m , we transition to state q nm . 

Figure 3.18 sketches how this intersection (A) and composition (o) pro¬ 
cess might be carried out. 
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Figure 3.18 Intersection and composition of transducers. 


Since there are a number of rule—)-FST compilers, it is almost never 
necessary in practice to write an FST by hand. Kaplan and Kay (1994) give 
the mathematics that define the mapping from rules to two-level relations, 
and Antworth (1990) gives details of the algorithms for rule compilation. 
Mohri (1997) gives algorithms for transducer minimization and determiniza- 
tion. 


3.4 Lexicon-free FSTs: The Porter Stemmer 

While building a transducer from a lexicon plus rules is the standard al¬ 
gorithm for morphological parsing, there are simpler algorithms that don’t 
require the large on-line lexicon demanded by this algorithm. These are used 
especially in Information Retrieval (IR) tasks (Chapter 17) in which a user 
needs some information, and is looking for relevant documents (perhaps on 
the web, perhaps in a digital library database). She gives the system a query 
with some important characteristics of documents she desires, and the IR 
system retrieves what it thinks are the relevant documents. One common 
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type of query is Boolean combinations of relevant keywords or phrases, e.g. 
(marsupial OR kangaroo OR koala). The system then returns documents that 
have these words in them. Since a document with the word marsupials might 
not match the keyword marsupial , some IR systems first run a stemmer on 
the keywords and on the words in the document. Since morphological pars¬ 
ing in IR is only used to help form equivalence classes, the details of the 
suffixes are irrelevant; what matters is determining that two words have the 
same stem. 

One of the most widely used such stemming algorithms is the simple 
and efficient Porter (1980) algorithm, which is based on a series of simple 
cascaded rewrite rules. Since cascaded rewrite rules are just the sort of thing 
that could be easily implemented as an FST, we think of the Porter algorithm 
as a lexicon-free FST stemmer (this idea will be developed further in the 
exercises (Exercise 3.7). The algorithm contains rules like: 

(3.6) ATIONAL —> ATE (e.g. relational —> relate) 

(3.7) ENG —> £ if stem contains vowel (e.g. motoring —> motor) 

The algorithm is presented in detail in Appendix B. 

Do stemmers really improve the performance of information retrieval 
engines? One problem is that stemmers are not perfect. For example Krovetz 
(1993) summarizes the following kinds of errors of omission and of commis¬ 
sion in the Porter algorithm: 


Errors of Commission Errors of Omission 


organization 

organ 

European 

Europe 

doing 

doe 

analysis 

analyzes 

generalization 

generic 

matrices 

matrix 

numerical 

numerous 

noise 

noisy 

policy 

police 

sparse 

sparsity 

university 

universe 

explain 

explanation 

negligible 

negligent 

urgency 

urgent 


Krovetz also gives the results of a number of experiments testing whether 
the Porter stemmer actually improved IR performance. Overall he found 
some improvement, especially with smaller documents (the larger the docu¬ 
ment, the higher the chance the keyword will occur in the exact form used 
in the query). Since any improvement is quite small, IR engines often don’t 
use stemming. 


KEYWORDS 


STEMMING 
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3.5 Human Morphological Processing 


In this section we look at psychological studies to learn how multi-morphemic 
words are represented in the minds of speakers of English. For example, con¬ 
sider the word walk and its inflected forms walks, and walked. Are all three 
in the human lexicon? Or merely walk plus as well as -ed and -si How 
about the word happy and its derived forms happily and happinessl We can 
imagine two ends of a theoretical spectrum of representations. The full list- 
full listing ing hypothesis proposes that all words of a language are listed in the mental 
lexicon without any internal morphological structure. On this view, mor¬ 
phological structure is simply an epiphenomenon, and walk, walks, walked, 
happy , and happily are all separately listed in the lexicon. This hypothesis 
is certainly untenable for morphologically complex languages like Turkish 
(Hankamer (1989) estimates Turkish as 200 billion possible words). The 
redundancy minimum redundancy hypothesis suggests that only the constituent mor¬ 
phemes are represented in the lexicon, and when processing walks , (whether 
for reading, listening, or talking) we must access both morphemes {walk and 
-s) and combine them. 

Most modem experimental evidence suggests that neither of these is 
completely true. Rather, some kinds of morphological relationships are men¬ 
tally represented (particularly inflection and certain kinds of derivation), but 
others are not, with those words being fully listed. Stanners et al. (1979), 
for example, found that derived forms {happiness, happily ) are stored sepa¬ 
rately from their stem {happy), but that regularly inflected forms (pouring ) 
are not distinct in the lexicon from their stems (pour). They did this by using 
a repetition priming experiment. In short, repetition priming takes advantage 
of the fact that a word is recognized faster if it has been seen before (if it is 
primed primed). They found that lifting primed lift, and burned primed burn, but 
for example selective didn’t prime select. Figure 3.19 sketches one possible 
representation of their finding: 
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In a more recent study, Marslen-Wilson et al (1994) found that spoken 
derived words can prime their stems, but only if the meaning of the derived 
form is closely related to the stem. For example government primes govern , 
but department does not prime depart. Grainger et al (1991) found similar 
results with prefixed words (but not with suffixed words). Marslen-Wilson 
et al (1994) represent a model compatible with their own findings as fol¬ 
lows: 



Other evidence that the human lexicon represents some morphological 
structure comes from speech errors, also called slips of the tongue. In 
normal conversation, speakers often mix up the order of the words or initial 
sounds: 

if you break it it’ll drop 

I don’t have time to work to watch television because I have to 
work 

But inflectional and derivational affixes can also appear separately from 
their stems, as these examples from Fromkin and Ratner (1998) and Garrett 
(1975) show: 

it’s not only us who have screw looses (for ‘screws loose’) 
words of rule formation (for ‘rules of word formation’) 
easy enoughly (for ‘easily enough’) 

which by itself is the most unimplausible sentence you can imagine 


The ability of these affixes to be produced separately from their stem 
suggests that the mental lexicon must contain some representation of the 
morphological structure of these words. 

In summary, these results suggest that morphology does play a role in 
the human lexicon, especially productive morphology like inflection. They 
also emphasize the important of semantic generalizations across words, and 
suggest that the human auditory lexicon (representing words in terms of their 
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sounds) and the orthographic lexicon (representing words in terms of letters) 
may have similar structures. Finally, it seems that many properties of lan¬ 
guage processing, like morphology, may apply equally (or at least similarly) 
to language comprehension and language production. 


3.6 Summary 

This chapter introduced morphology, the arena of language processing deal¬ 
ing with the subparts of words, and the finite-state transducer, the com¬ 
putational device that is commonly used to model morphology. Here’s a 
summary of the main points we covered about these ideas: 

• morphological parsing is the process of finding the constituent mor¬ 
phemes in a word (e.g. cat +N +PL for cats). 

• English mainly uses prefixes and suffixes to express inflectional and 
derivational morphology. 

• English inflectional morphology is relatively simple and includes per¬ 
son and number agreement (-s) and tense markings (-ed and ing). 

• English derivational morphology is more complex and includes suf¬ 
fixes like - ation , - ness , -able as well as prefixes like co- and re-. 

• many constraints on the English morphotactics (allowable morpheme 
sequences) can be represented by finite automata. 

• finite-state transducers are an extension of finite-state automata that 
can generate output symbols. 

• two-level morphology is the application of finite-state transducers to 
morphological representation and parsing. 

• spelling rules can be implemented as transducers. 

• there are automatic transducer-compilers that can produce a transducer 
for any simple rewrite rule. 

• the lexicon and spelling rules can be combined by composing and in¬ 
tersecting various transducers. 

• the Porter algorithm is a simple and efficient way to do stemming, 
stripping off affixes. It is not as accurate as a transducer model that in¬ 
cludes a lexicon, but may be preferable for applications like informa¬ 
tion retrieval in which exact morphological structure is not needed. 
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Bibliographical and Historical Notes 


Despite the close mathematical similarity of finite-state transducers to finite- 
state automata, the two models grew out of somewhat different traditions. 
Chapter 2 described how the finite automaton grew out of Turing’s (1936) 
model of algorithmic computation, and McCulloch and Pitts finite-state-like 
models of the neuron. The influence of the Turing machine on the trans¬ 
ducer was somewhat more indirect. Huffman (1954) proposed what was 
essentially a state-transition table to model the behavior of sequential cir¬ 
cuits, based on the work of Shannon (1938) on an algebraic model of relay 
circuits. Based on Turing and Shannon’s work, and unaware of Huffman’s 
work, Moore (1956) introduced the term finite automaton for a machine 
with a finite number of states with an alphabet of input symbols and an al¬ 
phabet of output symbols. Mealy (1955) extended and synthesized the work 
of Moore and Huffman. 

The finite automata in Moore’s original paper, and the extension by 
Mealy differed in an important way. In a Mealy machine, the input/output 
symbols are associated with the transitions between states. The finite-state 
transducers in this chapter are Mealy machines. In a Moore machine, the 
input/output symbols are associated with the state; we will see examples of 
Moore machines in Chapter 5 and Chapter 7. The two types of transduc¬ 
ers are equivalent; any Moore machine can be converted into an equivalent 
Mealy machine and vice versa. 

Many early programs for morphological parsing used an affix-stripping 
approach to parsing. For example Packard’s (1973) parser for ancient Greek 
iteratively stripped prefixes and suffixes off the input word, making note of 
them, and then looked up the remainder in a lexicon. It returned any root that 
was compatible with the stripped-off affixes. This approach is equivalent to 
the bottom-up method of parsing that we will discuss in Chapter 10. 

AMPLE (A Morphological Parser for Linguistic Exploration) (Weber 
and Mann, 1981; Weber et al , 1988; Hankamer and Black, 1991) is another 
early bottom-up morphological parser. It contains a lexicon with all possible 
surface variants of each morpheme (these are called allomorphs), together 
with constraints on their occurrence (for example in English the -es allo- 
morph of the plural morpheme can only occur after s, x, z, sh, or ch). The 
system finds every possible sequence of morphemes which match the input 
and then filters out all the sequences which have failing constraints. 



88 


Chapter 3. Morphology and Finite-State Transducers 


An alternative approach to morphological parsing is called generate- 
and-test or analysis-by-synthesis approach. Hankamer’s (1986) keCi is a 
morphological parser for Turkish which is guided by a finite-state represen¬ 
tation of Turkish morphemes. The program begins with a morpheme that 
might match the left edge of the word, and applies every possible phonolog¬ 
ical rule to it, checking each result against the input. If one of the outputs 
succeeds, the program then follows the finite-state morphotactics to the next 
morpheme and tries to continue matching the input. 

The idea of modeling spelling rules as finite-state transducers is really 
based on Johnson’s (1972) early idea that phonological rules (to be discussed 
in Chapter 4) have finite-state properties. Johnson’s insight unfortunately did 
not attract the attention of the community, and was independently discovered 
by Roland Kaplan and Martin Kay, first in an unpublished talk Kaplan and 
Kay (1981) and then finally in print (Kaplan and Kay, 1994). Kaplan and 
Kay’s work was followed up and most fully worked out by Koskenniemi 
(1983), who described finite-state morphological rules for Finnish. Kart- 
tunen (1983) built a program called KIMMO based on Koskenniemi’s mod¬ 
els. Antworth (1990) gives many details of two-level morphology and its 
application to English. Besides Koskenniemi’s work on Finnish and that of 
Antworth (1990) on English, two-level or other finite-state models of mor¬ 
phology have been worked out for many languages, such as Turkish (Oflazer, 
1993) and Arabic (Beesley, 1996). Antworth (1990) summarizes a number 
of issues in finite-state analysis of languages with morphologically complex 
processes like infixation and reduplication (for example Tagalog) and gem¬ 
ination (for example Hebrew). Karttunen (1993) is a good summary of the 
application of two-level morphology specifically to phonological rules of the 
sort we will discuss in Chapter 4. Barton et al (1987) bring up some com¬ 
putational complexity problems with two-level models, which are responded 
to by Koskenniemi and Church (1988). 

Students interested in further details of the fundamental mathematics 
of automata theory should see Hopcroft and Ullman (1979) or Lewis and 
Papadimitriou (1981). Mohri (1997) and Roche and Schabes (1997b) give 
additional algorithms and mathematical foundations for language applica¬ 
tions, including e.g. the details of the algorithm for transducer minimization. 
Sproat (1993) gives a broad general introduction to computational morphol¬ 
ogy. 
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Exercises 


3.1 Add some adjectives to the adjective FSA in Figure 3.5. 

3.2 Give examples of each of the noun and verb classes in Figure 3.6, and 
find some exceptions to the rules. 

3.3 Extend the transducer in Figure 3.14 to deal with sh and ch. 

3.4 Write a transducer(s) for the K insertion spelling rule in English. 

3.5 Write a transducer(s) for the consonant doubling spelling rule in En¬ 
glish. 


3.6 The Soundex algorithm (Odell and Russell, 1922; Knuth, 1973) is a 
method commonly used in libraries and older Census records for represent¬ 
ing people’s names. It has the advantage that versions of the names that are 
slightly misspelled or otherwise modified (common, for example, in hand¬ 
written census records) will still have the same representation as correctly- 
spelled names. (For example, Jurafsky, Jarofsky, Jarovsky, and Jarovski all 
map to J612). 

a. Keep the first letter of the name, and drop all occurrences of non-initial 
a, e, h, i, o, u, w, y 

b. Replace the remaining letters with the following numbers: 

b, f, p, v —► 1 

c, g,j, k, q, s, x, z-^2 

d, t — y 3 

1 4 

m, n — y 5 
r —» 6 

c. Replace any sequences of identical numbers with a single number (i.e. 
666 6 ) 

d. Convert to the form Letter Digit Digit Digit by dropping 
digits past the third (if necessary) or padding with trailing zeros (if 
necessary). 

The exercise: write a FST to implement the Soundex algorithm. 

3.7 Implement one of the steps of the Porter Stemmer as a transducer. 
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3.8 Write the algorithm for parsing a finite-state transducer, using the pseudo¬ 
code introduced in Chapter 2. You should do this by modifying the algorithm 
nd-recognize in Figure 2.21 in Chapter 2. 

3.9 Write a program that takes a word and, using an on-line dictionary, 
computes possible anagrams of the word. 

3.10 In Figure 3.14, why is there a z> >v, x arc from q$ to q \? 




COMPUTATIONAL 
PHONOLOGY AND 
TEXT-TO-SPEECH 


You like po-tay-to and I like po-tah-to. 

You like to-may-to and I like to-mah-to . 

Po-tay-to , po-tah-to, 

To-may-to y to-mah-to y 
Let's call the whole thing off! 

George and Ira Gershwin, Let's Call the Whole Thing Off 
from Shall We Dance , 1937 


The previous chapters have all dealt with language in text format. We now 
turn to speech. The next four chapters will introduce the fundamental in¬ 
sights and algorithms necessary to understand modem speech recognition 
and speech synthesis technology, and the related branch of linguistics called 

computational phonology. 

Let’s begin by defining these areas. The core task of automatic speech 
recognition is take an acoustic waveform as input and produce as output 
a string of words. The core task of text-to-speech synthesis is to take a 
sequence of text words and produce as output an acoustic waveform. The 
uses of speech recognition and synthesis are manifold, including automatic 
dictation/transcription, speech-based interfaces to computers and telephones, 
voice-based input and output for the disabled, and many others that will be 
discussed in greater detail in Chapter 7. 

This chapter will focus on an important part of both speech recognition 
and text-to-speech systems: how words are pronounced in terms of individ¬ 
ual speech units called phones. A speech recognition system needs to have 
a pronunciation for every word it can recognize, and a text-to-speech system 
needs to have a pronunciation for every word it can say. The first section of 
this chapter will introduce phonetic alphabets for describing pronunciation. 
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part of the field of phonetics. We then introduce articulatory phonetics, the 
study of how speech sounds are produced by articulators in the mouth. 

Modeling pronunciation would be much simpler if a given phone was 
always pronounced the same in every context. Unfortunately this is not the 
case. As we will see, the phone [t] is pronounced very differently in different 
phonetic environments. Phonology is the area of linguistics that describes 
the systematic way that sounds are differently realized in different environ¬ 
ments, and how this system of sounds is related to the rest of the grammar. 
The next section of the chapter will describe the way we write phonological 
rules to describe these different realizations. 

We next introduce an area known as computational phonology. One 
important part of computational phonology is the study of computational 
mechanisms for modeling phonological rules. We will show how the spelling- 
rule transducers of Chapter 3 can be used to model phonology. We then 
discuss computational models of phonological learning: how phonological 
rules can be automatically induced by machine learning algorithms. 

Finally, we apply the transducer-based model of phonology to an im¬ 
portant problem in text-to-speech systems: mapping from strings of letters 
to strings of phones. We first survey the issues involved in building a large 
pronunciation dictionary, and then show how the transducer-based lexicons 
and spelling rules of Chapter 3 can be augmented with pronunciations to 
map from orthography to pronunciation. 

This chapter focuses on the non-probabilistic areas of computational 
linguistics and pronunciations modeling. Chapter 5 will turn to the role of 
probabilistic models, including such areas as probabilistic models of pronun¬ 
ciation variation and probabilistic methods for learning phonological rules. 


Speech Sounds and Phonetic Transcription 

The study of the pronunciation of words is part of the field of phonetics, the 
study of the speech sounds used in the languages of the world. We will be 
modeling the pronunciation of a word as a string of symbols which represent 
phones or segments. A phone is a speech sound; we will represent phones 
with phonetic symbols that bears some resemblance to a letter in an alpha¬ 
betic language like English. So for example there is a phone represented by l 
that usually corresponds to the letter / and a phone represented by p that usu¬ 
ally corresponds to the letter p. Actually, as we will see later, phones have 
much more variation than letters do. This chapter will only briefly touch 
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on other aspects of phonetics such as prosody, which includes things like 
changes in pitch and duration. 


IPA 

ARPAbet 


IPA 

ARPAbet 

Symbol 

Symbol 

Word 

Transcription 

Transcription 

[pi 

[pi 

parsley 

['parsli] 

[p aa r s 1 iy] 

[« 

[t] 

tarragon 

['t aerogun] 

[t ae r ax g aa n] 

[k] 

[k] 

catnip 

['kactnip] 

[k ae t n ix p] 

[b] 

[b] 

bay 

[bei] 

[bey] 

[d] 

[d] 

dill 

[dil] 

[d ih 1] 

[g] 

[g] 

garlic 

['gorlik] 

[g aa r 1 ix k] 

1 m | 

[m] 

mint 

[mint] 

[m ih n t] 

[n] 

[n] 

nutmeg 

['nAtmeg] 

[n ah t m eh g 

m 

[ng] 

ginseng 

['d 3 insii)] 

[jh ih n s ix ng] 

m 

[f] 

fennel 

['fenl] 

[f eh n el] 

[v] 

[v] 

clove 

[klouv] 

[k 1 ow v] 

[0] 

[th] 

thistle 

['Orel] 

[th ih s el] 

[8] 

[dh] 

heather 

['he^a 1 ] 

[h eh dh axr] 

[S] 

[s] 

sage 

[seid 3 ] 

[seyjh] 

[Z] 

[z] 

hazelnut 

['heizlnAt] 

[h ey z el n ah t] 

HI 

[sh] 

squash 

[skwaj] 

[s k w a sh] 

[ 3 ] 

[zh] 

ambrosia 

[aem'brougo] 

[ae m b r ow zh ax] 

IHI 

[ch] 

chicory 

['tjika-i] 

[ch ih k axr iy ] 

[d 3 ] 

[jh] 

sage 

[seid 3 ] 

[seyjh] 

[1] 

[1] 

licorice 

E'likadJ] 

[1 ih k axr ix sh] 

[w] 

[w] 

kiwi 

['kiwi] 

[k iy w iy] 

[r] 

M 

parsley 

['parsli] 

[p aa r s 1 iy] 

U] 

[y] 

yew 

[yu] 

[y uw] 

[h] 

[h] 

horseradish 

['harsraedij] 

[h ao r s r ae d ih sh] 

[?] 

M 

uh-oh 

[?a?ou] 

[q ah q ow] 

[r] 

[dx] 

butter 

['bAra 1 ] 

[b ah dx axr ] 

[f] 

[nx] 

wintergreen 

wira'grinj 

[w ih nx axr grin] 

[1] 

[el] 

thistle 

['6isl] 

[th ih s el] 

Figure 4.1 IPA 

and ARPAbet symbols for transcription of English 

consonants. 





This section surveys the different phones of English, particularly Amer¬ 
ican English, showing how they are produced and how they are represented 
symbolically. We will be using two different alphabets for describing phones. 
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ipa The first is the International Phonetic Alphabet (IPA). The IPA is an evolv¬ 

ing standard originally developed by the International Phonetic Association 
in 1888 with the goal of transcribing the sounds of all human languages. 
The IPA is not just an alphabet but also a set of principles for transcription, 
which differ according to the needs of the transcription, so the same utter¬ 
ance can be transcribed in different ways all according to the principles of 
the IPA. In the interests of brevity in this book we will focus on the symbols 
that are most relevant for English; thus Figure 4.1 shows a subset of the IPA 
symbols for transcribing consonants, while Figure 4.2 shows a subset of the 
IPA symbols for transcribing vowels. 1 These tables also give the ARPAbet 
symbols; ARPAbet (?) is another phonetic alphabet, but one that is specifi¬ 
cally designed for American English and which uses ASCII symbols; it can 
be thought of as a convenient ASCII representation of an American-English 
subset of the IPA. ARPAbet symbols are often used in applications where 
non-ASCII fonts are inconvenient, such as in on-line pronunciation dictio¬ 
naries. 

Many of the IPA and ARPAbet symbols are equivalent to the Roman 
letters used in the orthography of English and many other languages. So for 
example the IPA and ARPAbet symbol [p] represents the consonant sound at 
the beginning of platypus , puma , and pachyderm , the middle of leopard , or 
the end of antelope (note that the final orthographic e of antelope does not 
correspond to any final vowel; the p is the last sound). 

The mapping between the letters of English orthography and IPA sym¬ 
bols is rarely as simple as this, however. This is because the mapping be¬ 
tween English orthography and pronunciation is quite opaque; a single letter 
can represent very different sounds in different contexts. Figure 4.3 shows 
that the English letter c is represented as IPA [k] in the word cougar , but IPA 
[s] in the word civet. Besides appearing as c and k , the sound marked as [k] 
in the IPA can appear as part of x (fox), as ck (jackal ), and as cc (raccoon). 
Many other languages, for example Spanish, are much more transparent in 
their sound-orthography mapping than English. 

The Vocal Organs 

SI We turn now to articulatory phonetics, the study of how phones are pro- 

PHONETICS 

duced, as the various organs in the mouth, throat, and nose modify the airflow 
from the lungs. 


1 For simplicity we use the symbol [r] for the American English ‘r’ sound, rather than the 
more standard IPA symbol [i]. 
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Symbol 
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IPA 

Transcription 

ARPAbet 

Transcription 

W 

[iy] 

lily 

[M] 

[1 ih 1 iy] 

[i] 

[ih] 

lily 

['bli] 

[1 ih 1 iy] 

[ei] 

[ey] 

dmsy 

['deizi] 

[d ey z i] 

M 

[eh] 

poinsettia 

[poin'serio] 

[p oy n s eh dx iy ax] 

[«] 

[ae] 

aster 

['assta 1 ] 

[ae s t axr] 


[aa] 

poppy 

E'papi] 

[p aa p i] 

M 

[ao] 

orchid 

[brkid] 

[ao r k ix d] 

M 

[uh] 

woodruff 

['wudrAf] 

[w uh d r ah f] 

[ou] 

[ow] 

lotus 

['louros] 

[1 ow dx ax s] 

M 

[uw] 

tulip 

['tulip] 

[t uw 1 ix p] 

W 

[Uh] 

buttercup 

['bAra'-.kAp] 

[b uh dx axr k uh p] 

M 

[er] 

bird 

[’byd] 

[b er d] 

[ai] 

[ay] 

iris 

['airis] 

[ay r ix s] 

[au] 

[aw] 

sunflower 

['sAnflaua 1 ] 

[s ah n f 1 aw axr] 

[01] 

[oy] 

poinsettia 

[poin'serio] 

[p oy n s eh dx iy ax] 

[ju] 

[y uw] 

feverfew 

[fivaTju] 

[f iy v axr f y u] 

M 

[ax] 

woodruff 

['wudref] 

[w uh d r ax f] 

M 

[axr] 

heather 

['heda 1 ] 

[h eh dh axr] 

[i] 

[ix] 

tulip 

[’tulip] 

[t uw 1 ix p] 

M 

[ux] 


a 

[] 

Figure 4.2 IPA and ARPAbet symbols for transcription of English vowels 


Word 

jackal 

raccoon 

cougar 

civet 

IPA 

['d 3 ae.kl] 

[rae.'kun] 

['ku.ga 1 ] 

['si.vit] 

ARPAbet 

[jh ae k el] 

[r ae k uw n] 

[k uw g axr] 

[s ih v ix t] 


Figure 4.3 The mapping between IPA symbols and letters in English or¬ 
thography is complicated; both IPA [k] and English orthographic [c] have 
many alternative realizations 


Sound is produced by the rapid movement of air. Most sounds in hu¬ 
man languages are produced by expelling air from the lungs through the 
windpipe (technically the trachea) and then out the mouth or nose. As it 
passes through the trachea, the air passes through the larynx, commonly 
known as the Adam’s apple or voicebox. The larynx contains two small 
folds of muscle, the vocal folds (often referred to non-technically as the vo¬ 
cal cords) which can be moved together or apart. The space between these 
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GLOTTIS 


VOICED 

UNVOICED 

VOICELESS 



two folds is called the glottis. If the folds are close together (but not tightly 
closed), they will vibrate as air passes through them; if they are far apart, 
they won’t vibrate. Sounds made with the vocal folds together and vibrating 
are called voiced; sounds made without this vocal cord vibration are called 
unvoiced or voiceless. Voiced sounds include [b], [d], [g], [v], [z], and all 
the English vowels, among others. Unvoiced sounds include [p], [t], [k], [f], 
[z], and others. 

The area above the trachea is called the vocal tract, and consists of the 
oral tract and the nasal tract. After the air leaves the trachea, it can exit the 
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body through the mouth or the nose. Most sounds are made by air passing 
through the mouth. Sounds made by air passing through the nose are called 
nasal sounds; nasal sounds use both the oral and nasal tracts as resonating 
cavities; English nasal sounds include m * and n , and ng. 

Phones are divided into two main classes: consonants and vowels* ggjff 
Both kinds of sounds arc formed by the motion of air through the mouth, vowels 
throat or nose. Consonants are made by restricting or blocking the airflow in 
some way, and may be voiced or unvoiced. Vowels have less obstruction, are 
usually voiced, and arc generally louder and longer-lasting than consonants. 

The technical use of these terms is much like the common usage; [p], lb), 

[t], fdh fk|, [gl, [f|, [v|, [s], fz), [r] t HI, etc., are consonants; [aa], fael, [awl, 

[ao], [ih], [aw], [o\v], [uw], etc*, arc vowels. Semivowels (such as [y] and 
[w]} have some of the properties of both; they are voiced like vowels, but 
they are short and less syllabic like consonants. 

Consonants: Place of Articulation 

Because consonants are made by restricting the airflow in some way, con¬ 
sonants can be distinguished by where this restriction is made: the point 
of maximum restriction is called the place of articulation of a consonant, place 
Places of articulation, shown in Figure 4.5, are often used in automatic 
speech recognition as a useful way of grouping phones together into equiva¬ 
lence classes: 



labial: Consonants whose main restriction is formed by the two lips 


LABIAL 
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DENTAL 


ALVEOLAR 


PALATAL 

PALATE 


VELAR 

VELUM 


GLOTTAL 


MANNER 


STOP 


coming together have a bilabial place of articulation. In English these 
include [p] as in possum, [b] as in bear, and [m] as in marmot. The En¬ 
glish labiodental consonants [vl and [f] are made by pressing the bot¬ 
tom lip against the upper row of teeth and letting the air flow through 
the space in the upper teeth. 

• dental: Sounds that are made by placing the tongue against the teeth 
are dentals. The main dentals in English are the [0] of thing or the [5] 
of though, which are made by placing the tongue behind the teeth with 
the tip slightly between the teeth. 

• alveolar: The alveolar ridge is the portion of the roof of the mouth just 
behind the upper teeth. Most speakers of American English make the 
phones [s], [z], [t], and [d] by placing the tip of the tongue against the 
alveolar ridge. 

• palatal: The roof of the mouth (the palate) rises sharply from the 
back of the alveolar ridge. The palato-alveolar sounds [J] {shrimp), 
[tj] {chinchilla), [ 3 ] {Asian), and [d%] (jaguar) are made with the blade 
of the tongue against this rising back of the alveolar ridge. The palatal 
sound [y] of yak is made by placing the front of the tongue up close to 
the palate. 

• velar: The velum or soft palate is a movable muscular flap at the very 
back of the roof of the mouth. The sounds [k] (cuckoo), [g] (goose), 
and [rj] (kingfisher) are made by pressing the back of the tongue up 
against the velum. 

• glottal: The glottal stop [?] is made by closing the glottis (by bringing 
the vocal folds together). 

Consonants: Manner of Articulation 

Consonants are also distinguished by how the restriction in airflow is made, 
for example whether there is a complete stoppage of air, or only a partial 
blockage, etc. This feature is called the manner of articulation of a conso¬ 
nant. The combination of place and manner of articulation is usually suffi¬ 
cient to uniquely identify a consonant. Here are the major manners of artic¬ 
ulation for English consonants: 

• stop: A stop is a consonant in which airflow is completely blocked 
for a short time. This blockage is followed by an explosive sound as 
the air is released. The period of blockage is called the closure and 
the explosion is called the release. English has voiced stops like [b], 
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[d], and [g] as well as unvoiced stops like [p], [t], and [k]. Stops are 
also called plosives. It is possible to use a more narrow (detailed) tran¬ 
scription style to distinctly represent the closure and release parts of 
a stop, both in ARPAbet and IPA-style transcriptions. For example 
the closure of a [p], [t], or [k] would be represented as [pci], [tel], or 
[kcl] (respectively) in the ARPAbet, and [p n ], [t n ], or [k n ] (respectively) 
in IPA style. When this form of narrow transcription is used, the un¬ 
marked ARPABET symbols [p], [t], and [k] indicate purely the release 
of the consonant. We will not be using this narrow transcription style 
in this chapter. 

• nasals: The nasal sounds [n], [m], and [rj] are made by lowering the nasals 
velum and allowing air to pass into the nasal cavity. 

• fricative: In fricatives, airflow is constricted but not cut off completely, fricative 
The turbulent airflow that results from the constriction produces a char¬ 
acteristic ‘hissing’ sound. The English labiodental fricatives [f] and [v] 

are produced by pressing the lower lip against the upper teeth, allow¬ 
ing a restricted airflow between the upper teeth. The dental fricatives 
[0] and [5] allow air to flow around the tongue between the teeth. The 
alveolar fricatives [s] and [z] are produced with the tongue against the 
alveolar ridge, forcing air over the edge of the teeth. In the palato- 
alveolar fricatives [J] and [g] the tongue is at the back of the alveolar 
ridge forcing air through a groove formed in the tongue. The higher- 
pitched fricatives (in English [s], [z], [J] and [ 3 ]) are called sibilants, sibilants 
Stops that are followed immediately by fricatives are called affricates; 
these include English [tj] (chicken) and [dg] (giraffe)). 

• approximant: In approximants, the two articulators are close together mant oxi ' 
but not close enough to cause turbulent airflow. In English [y] (yellow ), 

the tongue moves close to the roof of the mouth but not close enough 
to cause the turbulence that would characterize a fricative. In English 
[w] (wormwood), the back of the tongue comes close to the velum. 
American [r] can be formed in at least two ways; with just the tip of 
the tongue extended and close to the palate or with the whole tongue 
bunched up near the palate. [ 1 ] is formed with the tip of the tongue up 
against the alveolar ridge or the teeth, with one or both sides of the 
tongue lowered to allow air to flow over it. [1] is called a lateral sound 
because of the drop in the sides of the tongue. 

• tap: A tap or flap [r] is a quick motion of the tongue against the alve- tap 
olar ridge. The consonant in the middle of the word lotus ([louros]) is 


FLAP 
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a tap in most dialects of American English; speakers of many British 
dialects would use a [(] instead of a tap in this word. 


Vowels 

Like consonants, vowels can be characterized by the position of the articu¬ 
lators as they are made. The two most relevant parameters for vowels are 
what is called vowel height, which correlates roughly with the location of 
the highest part of the tongue, and the shape of the lips (rounded or not). 
Figure 4,6 shows the position of the tongue for different vowels. 




^ nasal tract 



Mlb j 








r f closed 




\\ f Vp velum ^ 


L 1 I . ‘ 


palate^ 

yjpQS 



lieeil | lyj 

hud | we] 

who’d | qtv] 

Figure 4.6 Positions of the tongue lor three English vowels, high front [iy]. 

low front [ae] and high back [uv]; tongue positions modeled after Ladefogcd 

(1996). 




In the vowel [i], for example, the highest point of the Longue is toward 
the front of the mouth. In the vowel [u], by contrast, the high-point of the 
tongue is located toward the back of the mouth. Vowels in which the tongue 
front is raised toward the front arc called front vowels; those in which the Longue 
back is raised toward the back are called back vowels. Note that while both [t] 
and [e] are front vowels, the tongue is higher for [i] than lor [e]. Vowels in 
high which the highest point of the tongue is comparatively high arc called high 

vowels; vowels with mid or low values of maximum tongue height arc called 
mid vowels or low vowels, respectively. 

Figure 4.7 shows a schematic character!/.alien of the vowel height of 
different vowels. It is schematic because the abstract property height only 
correlates roughly with actual tongue positions; it is in fact a more accurate 
reflection of acoustic facts. Note that the chart has two kinds of vowels: 
those in which tongue height is represented as a point and those in which it 
is represented as a vector, A vowels in which the tongue position changes 
diphthong markedly during the production of the vowel is diphthong. English is par- 
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low 


Figure 4.7 Qualities of English vowels (after Ladefoged (1993)). 


ticularly rich in diphthongs; many are written with two symbols in the IPA 
(for example the [ei] of hake or the [on] of cobra). 

The second important articulatory dimension for vowels is the shape 
of the lips. Certain vowels are pronounced with the lips rounded (the same 
lip shape used for whistling). These rounded vowels include [u], [a], and the 
diphthong [ou]. 

Syllables 

Consonants and vowels combine to make a syllable. There is no completely 
agreed-upon definition of a syllable; roughly speaking a syllable is a vowel¬ 
like sound together with some of the surrounding consonants that are most 
closely associated with it. The IPA period symbol [.] is used to separate 
syllables, so parsley and catnip have two syllables ([par.sli] and ['kaet.mp] 
respectively), tarragon has three ['tae.ro.gan], and dill has one ([dil]). A syl¬ 
lable is usually described as having an optional initial consonant or set of 
consonants called the onset, followed by a vowel or vowels, followed by a 
final consonant or sequence of consonants called the coda. Thus d is the 
onset of [dil], while 1 is the coda. The task of breaking up a word into sylla¬ 
bles is called syllabification. Although automatic syllabification algorithms 
exist, the problem is hard, partly because there is no agreed-upon definition 
of syllable boundaries. Furthermore, although it is usually clear how many 
syllables are in a word, Ladefoged (1993) points out there are some words 
(meal, teal, seal, hire, fire, hour) that can be viewed either as having one 


ROUNDED 


SYLLABLE 
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ACCENTED 


LEXICAL 

STRESS 


HOMO¬ 

GRAPHS 


syllable or two. 

In a natural sentence of American English, certain syllables are more 
prominent than others. These are called accented syllables. Accented sylla¬ 
bles may be prominent because they are louder, they are longer, they are as¬ 
sociated with a pitch movement, or any combination of the above. Since ac¬ 
cent plays important roles in meaning, understanding exactly why a speaker 
chooses to accent a particular syllable is very complex. But one important 
factor in accent is often represented in pronunciation dictionaries. This fac¬ 
tor is called lexical stress. The syllable that has lexical stress is the one that 
will be louder or longer if the word is accented. For example the word pars¬ 
ley is stressed in its first syllable, not its second. Thus if the word parsley 
is accented in a sentence, it is the first syllable that will be stronger. We 
write the symbol ['] before a syllable to indicate that it has lexical stress (e.g. 
['par.sli]). This difference in lexical stress can affect the meaning of a word. 
For example the word content can be a noun or an adjective. When pro¬ 
nounced in isolation the two senses are pronounced differently since they 
have different stressed syllables (the noun is pronounced ['kan.tent] and the 
adjective [kon.'tent]. Other pairs like this include object (noun ['ab.djekt] 
and verb [ob.'dgckt]); see Cutler (1986) for more examples. Automatic dis¬ 
ambiguation of such homographs is discussed in Chapter 17. The role of 
prosody is taken up again in Section 4.7. 


4.2 The Phoneme and Phonological Rules 


'Sense me, while I kiss the sky 

Jimi Hendrix, Purple Haze 
’Scuse me, while I kiss this guy 

Common mis-hearing of same lyrics 

All [t]s are not created equally. That is, phones are often produced 
differently in different contexts. For example, consider the different pro¬ 
nunciations of [t] in the words tunafish and starfish. The [t] of tunafish is 
aspirated. Aspiration is a period of voicelessness after a stop closure and 
before the onset of voicing of the following vowel. Since the vocal cords are 
not vibrating, aspiration sounds like a puff of air after the [t] and before the 
unaspirated vowel. By contrast, a [t] following an initial [s] is unaspirated; thus the [t] 
in starfish ([starfij]) has no period of voicelessness after the [t] closure. This 
variation in the realization of [t] is predictable: whenever a [t] begins a word 
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or unreduced syllable in English, it is aspirated. The same variation occurs 
for [k]; the [k] of sky is often mis-heard as [g] in Jimi Hendrix’s lyrics because 
[k] and [g] are both unaspirated. In a very detailed transcription system we 
could use the symbol for aspiration [ h ] after any [t] (or [k] or [p]) which be¬ 
gins a word or unreduced syllable. The word tunafish would be transcribed 
[t h un9fij] (the ARPAbet does not have a way of marking aspiration). 

There are other contextual variants of [t]. For example, when [t] occurs 
between two vowels, particularly when the first is stressed, it is pronounced 
as a tap. Recall that a tap is a voiced sound in which the top of the tongue 
is curled up and back and struck quickly against the alveolar ridge. Thus the 
word buttercup is usually pronounced [bAra^kAp]/|b uh dx axr k uh p] rather 
than [bAta'kAp]/[b uh t axr k uh p]. 

Another variant of [t] occurs before the dental consonant [0]. Here the 
[t] becomes dentalized ([t]). That is, instead of the tongue forming a closure 
against the alveolar ridge, the tongue touches the back of the teeth. 

How do we represent this relation between a [t] and its different real¬ 
izations in different contexts? We generally capture this kind of pronunci¬ 
ation variation by positing an abstract class called the phoneme, which is phoneme 
realized as different allophones in different contexts. We traditionally write allophones 
phonemes inside slashes. So in the above examples, /t/ is a phoneme whose 
allophones include [t h ], [r], and [t]. A phoneme is thus a kind of general¬ 
ization or abstraction over different phonetic realizations. Often we equate 
the phonemic and the lexical levels, thinking of the lexicon as containing 
transcriptions expressed in terms of phonemes. When we are transcribing 
the pronunciations of words we can choose to represent them at this broad 
phonemic level; such a broad transcription leaves out a lot of predictable 

NARROW 

phonetic detail. We can also choose to use a narrow transcription that transcrip- 
includes more detail, including allophonic variation, and uses the various di¬ 
acritics. Figure 4.8 summarizes a number of allophones of III; Figure 4.9 
shows a few of the most commonly used IPA diacritics. 

The relationship between a phoneme and its allophones is often cap¬ 
tured by writing a phonological rule. Here is the phonological rule for den- 
talization in the traditional notation of Chomsky and Halle (1968): 

/t/ —^ [t] / — 9 (4.1) 

In this notation, the surface allophone appears to the right of the arrow, 
and the phonetic environment is indicated by the symbols surrounding the 

underbar (_). These rules resemble the rules of two-level morphology of 

Chapter 3 but since they don’t use multiple types of rewrite arrows, this rule 
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Phone 

Environment 

Example 

IPA 

"PT 

in initial position 

toucan 

[t h uk h aen] 

[t] 

after [s] or in reduced syllables 

starfish 

[starfif] 

[?] 

word-finally or after vowel before [n] 

kitten 

[k h i?n] 

[?t] 

sometimes word-finally 

cat 

[k h 03?t] 

M 

between vowels 

buttercup 

[bAf3'-k ,l Ap] 

[tl 

before consonants or word-finally 

fruitcake 

[fruEk h eik] 

[t] 

before dental consonants ([0]) 

eighth 

[eitG] 

[] 

sometimes word-finally 

past 

[paes] 

Figure 4.8 Some allophones of it! in General American English 


is ambiguous between an obligatory or optional rule. Here is a version of the 
flapping rule: 


/{^}/^[r]/V_V 


(4.2) 


Diacritics 


Suprasegmentals 



Voiceless 

[a] 

1 

Primary stress 

[pu.mo] 

h 

Aspirated 

[P h ] 


Secondary stress 

[fouro^raef] 

, 

Syllabic 

[1] 

: 

Long 

[a:] 

- 

Nasalized 

[*] 

* 

Half long 

M 

n 

Unreleased 

[tl 


Syllable break 

[pu.mo] 

n 

Dental 

[t] 





Figure 4.9 Some of the IPA diacritics and symbols for suprasegmentals. 


4.3 Phonological Rules and Transducers 

Chapter 3 showed that spelling rules can be implemented by transducers. 
Phonological rules can be implemented as transducers in the same way; 
indeed the original work by Johnson (1972) and Kaplan and Kay (1981) 
on finite-state models was based on phonological rules rather than spelling 
rules. There are a number of different models of computational phonol¬ 
ogy that use finite automata in various ways to realize phonological rules. 
We will describe the two-level morphology of Koskenniemi (1983) used in 
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Chapter 3, but the interested reader should be aware of other recent models. 2 
While Chapter 3 gave examples of two-level rules, it did not talk about the 
motivation for these rules, and the differences between traditional ordered 
rules and two-level rules. We will begin with this comparison. 

As a first example, Figure 4.10 shows a transducer which models the 
application of the simplified flapping rule in (4.3): 

/t/-> [r] /V _V (4.3) 


V:@ 



Figure 4.10 Transducer for English Flapping: ARPAbet ‘dx’ indicates a 
flap, and the ‘other’ symbol means ‘any feasible pair not used elsewhere in the 
transducer’. ‘@’ means ‘any symbol not used elsewhere on any arc’. 


The transducer in Figure 4.10 accepts any string in which flaps occur 
in the correct places (after a stressed vowel, before an unstressed vowel), and 
rejects strings in which flapping doesn’t occur, or in which flapping occurs 
in the wrong environment. Of course the factors that flapping are actually a 
good deal more complicated, as we will see in Section 5.7. 

In a traditional phonological system, many different phonological rules 
apply between the lexical form and the surface form. Sometimes these rules 
interact; the output from one rule affects the input to another rule. One way 
to implement rule-interaction in a transducer system is to run transducers in 
a cascade. Consider, for example, the rules that are needed to deal with the 
phonological behavior of the English noun plural suffix -s. This suffix is 

2 For example Bird and Ellison’s (1994) model of the multi-tier representations of autoseg- 
mental phonology in which each phonological tier is represented by a finite-state automaton, 
and auto segmental association by the synchronization of two automata. 
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pronounced [iz] after the phones [s], [f], [z], or [ 3 ] (so peaches is pronounced 
[pitfiz], and faxes is pronounced [facksiz], [z] after voiced sounds (pigs is pro¬ 
nounced [pigz]), and [s] after unvoiced sounds (cats is pronounced [kaets]). 
We model this variation by writing phonological rules for the realization of 
the morpheme in different contexts. We first need to choose one of these 
three forms (s, z, and iz) as the ‘lexical’ pronunciation of the suffix; we 
chose z only because it turns out to simplify rule writing. Next we write two 
phonological rules. One, similar to the E-insertion spelling rule of page 77, 
inserts a [i] after a morpheme-final sibilant and before the plural morpheme 
[z]. The other makes sure that the -s suffix is properly realized as [s] after 
unvoiced consonants. 


£ —^ i / [-hsibilant] ~ _z # (4.4) 

z —y s / [-voice] A _# (4.5) 

These two rules must be ordered ; rule (4.4) must apply before (4.5). 
This is because the environment of (4.4) includes z, and the rule (4.5) changes 
z. Consider running both rules on the lexical form fox concatenated with the 
plural -s: 


Lexical form: faks'z 

(4.4) applies: faks ~ iz 

(4.5) doesn 7 apply: faks ~ iz 

If the devoicing rule (4.5) was ordered first, we would get the wrong 
result (what would this incorrect result be?). This situation, in which one 
rule destroys the environment for another, is called bleeding: 3 

Lexical form: faks A z 

(4.5) applies: faks A s 

(4.4) doesn’t apply: faks A s 

As was suggested in Chapter 3, each of these rules can be represented 
by a transducer. Since the rules are ordered, the transducers would also need 
to be ordered. For example if they are placed in a cascade, the output of the 
first transducer would feed the input of the second transducer. 

Many rules can be cascaded together this way. As Chapter 3 discussed, 
running a cascade, particularly one with many levels, can be unwieldy, and 

3 If we had chosen to represent the lexical pronunciation of as [s] rather than [z], we would 
have written the rule inversely to voice the after voiced sounds, but the rules would still 
need to be ordered; the ordering would simply flip. 
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so transducer cascades are usually replaced with a single more complex 
transducer by composing the individual transducers. 

Koskenniemi’s method of two-level morphology that was sketchily 
introduced in Chapter 3 is another way to solve the problem of rule ordering. 
Koskenniemi (1983) observed that most phonological rules in a grammar 
are independent of one another; that feeding and bleeding relations between 
rules are not the norm. 4 Since this is the case, Koskenniemi proposed that 
phonological rules be run in parallel rather than in series. The cases where 
there is rule interaction (feeding or bleeding) we deal with by slightly modi¬ 
fying some rules. Koskenniemi’s two-level rules can be thought of as a way 
of expressing declarative constraints on the well-formedness of the lexical- 
surface mapping. 

Two-level rules also differ from traditional phonological rules by ex¬ 
plicitly coding when they are obligatory or optional, by using four differing 
rule operators; the rule corresponds to traditional obligatory phonolog¬ 
ical rules, while the => rule implements optional rules: 


Rule type 

a : b <= c d 

a : b => c d 

a:b4^c_d 

a : b /<= c_d 


Interpretation 

a is always realized as b in the context c _ d 

a may be realized as b only in the context c _ d 

a must be realized as b in context c _ d and nowhere else 

a is never realized as b in the context c _ d 


The most important intuition of the two-level rules, and the mechanism 
that lets them avoiding feeding and bleeding, is their ability to represent 
constraints on two levels. This is based on the use of the colon (V), which 
was touched in very briefly in Chapter 3. The symbol a:b means a lexical 

a that maps to a surface b. Thus a:b <=> :c _ means a is realized as b 

after a surface c. By contrast a:b & c: _means that a is realized as b 

after a lexical c. As discussed in Chapter 3, the symbol c with no colon is 
equivalent to c:c that means a lexical c which maps to a surface c. 

Figure 4.11 shows an intuition for how the two-level approach avoids 
ordering for the i-insertion and z-devoicing rules. The idea is that the z- 
devoicing rule maps a lexical z-insertion to a surface s and the i rule refers 
to the lexical z: 

The two-level rules that model this constraint are shown in (4.6) and 

(4.7): 


8: i 4^ [+sibilant]: ~_z: # 


(4.6) 


4 Feeding is a situation in which one rules creates the environment for another rule and so 
must be run beforehand. 
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£j+sib]^/A zVI i ex ( ca i i eve [ 

(f-TOicef ^ 

ix s surface level 

Figure 4.11 The constraints for the i-insertion and z-devoicing rules both 
refer to a lexical z, not a surface s. 


z : s «=> [-voice]: ~ _# (4.7) 

As Chapter 3 discussed, there are compilation algorithms for creating 
automata from rules. Kaplan and Kay (1994) give the general derivation of 
these algorithms, and Antworth (1990) gives one that is specific to two-level 
rules. The automata corresponding to the two rules are shown in Figure 4.12 
and Figure 4.13. Figure 4.12 is based on Figure 3.14 of Chapter 3; see page 
78 for a reminder of how this automaton works. Note in Figure 4.12 that 
the plural morpheme is represented by z:, indicating that the constraint is 
expressed about an lexical rather than surface z. 



Figure 4.14 shows the two automata run in parallel on the input [faks A z] 
(the figure uses the ARPAbet notation [f aa k s " z]). Note that both the au¬ 
tomata assuming the default mapping to remove the morpheme boundary, 
and that both automata end in an accepting state. 
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Figure 4.14 The transducer for the i-insertion rule 4.4 and the z-devoicing 
rule 4.5 run in parallel. 


4.4 Advanced Issues in Computational Phonology 

Harmony 

Rules like flapping, i-insertion, and z-devoicing are relatively simple as pho¬ 
nological rules go. In this section we turn to the use of the two-level or finite- 
state model of phonology to model more sophisticated phenomena; this sec¬ 
tion will be easier to follow if the reader has some knowledge of phonology. 

The Yawelmani dialect of Yokuts is a Native America language spoken in 
California with a complex phonological system. In particular, there are three 
phonological rules related to the realization of vowels that had to be ordered 
in traditional phonology, and whose interaction thus demonstrates a compli¬ 
cated use of finite-state phonology. These rules were first drawn up in the 
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VOWEL 

HARMONY 


traditional Chomsky and Halle (1968) format by Kisseberth (1969) follow¬ 
ing the field work of Newman (1944). 

First, Yokuts (like many other languages including for example Turk¬ 
ish and Hungarian) has a phonological phenomenon called vowel harmony. 
Vowel harmony is a process in which a vowel changes its form to look like 
a neighboring vowel. In Yokuts, a suffix vowel changes its form to agree 
in backness and roundness with the preceding stem vowel. That is, a front 
vowel like /i/ will appear as a backvowel [u] if the stem vowel is /u/ (ex¬ 
amples are taken from Cole and Kisseberth (1995): 5 


Lexical 


Surface Gloss 


dub+hin —> dubhun Tangles, non-future’ 
xil+hin xilhin ‘leads by the hand, non-future’ 
bok’+al —> bok’ol ‘might eat’ 
xat’+al —> xat’al ‘might find’ 


This Harmony rule has another constraint: it only applies if the suffix 
vowel and the stem vowel are of the same height. Thus /u/ and /i/ are both 
high, while /o/ and /a/ are both low. 

The second relevant rule, Lowering, causes long high vowels to be¬ 
come low; thus /u:/ becomes [o:] in the first example below: 


Lexical Surface Gloss 

?u:t’+it —> ?o:t’ut ‘steal, passive aorist’ 

mi:k’+it —> me:k 5 +it ‘swallow, passive aorist’ 

The third rule, Shortening, shortens long vowels if they occur in closed 
syllables: 


Lexical Surface 

s:ap+hin —> saphin 

sudu:k+hin —> sudokhun 


The Yokuts rules must be ordered, just as the i-insertion and z-devoicing 
rules had to be ordered. Harmony must be ordered before Lowering because 
the /u:/ in the lexical form /?u:t’+it/ causes the /i/ to become [u] before it 
lowers in the surface form [?o:t ’ut]. Lowering must be ordered before Short¬ 
ening because the /u:/ in /sudu:k+hin/ lowers to [o]; if it was ordered after 
shortening it would appear on the surface as [u]. 

Goldsmith (1993) and Lakoff (1993) independently observed that the 
Yokuts data could be modeled by something like a transducer; Karttunen 

5 For purposes of simplifying the explanation, this account ignores some parts of the system 
such as vowel underspecification (Archangeli, 1984). 
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(1998) extended the argument, showing that the Goldsmith and Lakoff con¬ 
straints could be represented either as a cascade of 3 rules in series, or in 
the two-level formalism as 3 rules in parallel; Figure 4.15 shows the two 
architectures. Just as in the two-level examples presented earlier, the rules 
work by referring sometimes to the lexical context, sometimes to the surface 
context; writing the rules is left as Exercise 4.10 for the reader. 


Lexical ? UI 

* 

+ 1 h 1 ■ 1 n l $ i_ 

? 

u: l t 1 + l h 1 1 1 n l i 


i 

Rounding 

Lowering 

o 

Shortening 

1 — r — 

i 

- ^ --i 

| Rounding ! | Lowering ! {Shortening 1 } 

1 T 1 

i -4- j 

T 

Surface \ | ? | O 


h |u | n | \ i \ 

? 

othun | 


a) Cascade of rules* b) Parallel two-level rules* 


Figure 4.15 Combining the rounding, lowering, and shortening rules for 
Yawelmani Yokuts. 


Templatic Morphology 

Finite-state models of phonology/morphology have also been proposed for 
the templatic (non-concatenative) morphology (discussed on page 60) com¬ 
mon in Semitic languages like Arabic, Hebrew, and Syriac. McCarthy (1981) 
proposed that this kind of morphology could be modeled by using different 
levels of representation that Goldsmith (1976) had called tiers. Kay (1987) tiers 
proposed a computational model of these tiers via a special transducer which 
reads four tapes instead of two, as in Figure 4.16: 

The tricky part here is designing a machine which aligns the various 
strings on the tapes in the correct way; Kay proposed that the binyan tape 
could act as a sort of guide for alignment. Kay’s intuition has led to a number 
of more fully-worked-out finite-state models of Semitic morphology such as 
Beesley’s (1996) model for Arabic and Kiraz’s (1997) model for Syriac. 

The more recent work of Komai (1991) and Bird and Ellison (1994) 
showed how one-tape automata (i.e. finite-state automata rather than 4-tape 
or even 2-tape transducers) could be used to model templatic morphology 
and other kinds of phenomena that are handeled with the tier-based autoseg- 
mental representations of Goldsmith (1976). ment? 
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OPTIMALITY 

THEORY 

OT 


FAITHFUL 


lexical tape \ 

a 

k 

t 

*1 

r b \ 
J 

i 



consonantal root tape ^ 

k 

t /b 




Q 

_' 4 ’v_ !!x _ 

binyan tape ^ 

V 

C 

c 

V 

) C ] 

V 

Li 

.. 

vocalic morph, tape | 






J 


j 


Figure 4.16 A finite-state model of templatic (‘non-concatenative’) mor¬ 
phology. From Kay (1987). 


Optimality Theory 

In a traditional phonological derivation, we are given given an underlying 
lexical form and a surface form. The phonological system then consists 
of one component: a sequence of rules which map the underlying form to 
the surface form. Optimality Theory (OT) (Prince and Smolensky, 1993) 
offers an alternative way of viewing phonological derivation, based on two 
functions (GEN and EVAL) and a set of ranked violable constraints (CON). 
Given an underlying form, the GEN function produces all imaginable surface 
forms, even those which couldn’t possibly be a legal surface form for the 
input. The EYAL function then applies each constraint in CON to these 
surface forms in order of constraint rank. The surface form which best meets 
the constraints is chosen. 

A constraint in OT represents a wellformedness constraint on the sur¬ 
face form, such as a phonotactic constraint on what segments can follow each 
other, or a constraint on what syllable structures are allowed. A constraint 
can also check how faithful the surface form is to the underlying form. 

Let’s turn to our favorite complicated language, Yawelmani, for an ex¬ 
ample. 6 In addition to the interesting vowel harmony phenomena discussed 
above, Yawelmani has a phonotactic constraints that rules out sequences of 
consonants. In particular three consonants in a row (CCC) are not allowed 
to occur in a surface word. Sometimes, however, a word contains two con¬ 
secutive morphemes such that the first one ends in two consonants and the 
second one starts with one consonant (or vice versa). What does the lan- 

6 The following explication of OT via the Yawelmani example draws heavily from 
Archangeli (1997) and a lecture by Jennifer Cole at the 1999 LSA Linguistic Institute. 
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guage do to solve this problem? It turns out that Yawelmani either deletes 
one of the consonants or inserts a vowel in between. 

For example, if a stem ends in a C, and its suffix starts with CC, the 
first C of the suffix is deleted (‘+’ here means a morpheme boundary): 

C-deletion C^8/C +_C (4.8) 

Here is an example where the CCVC ‘passive consequent adjunctive’ mor¬ 
pheme hne:l (actually the underlying form is /hnil/) drops the initial C if 
the previous morpheme ends in two consonants (and an example where it 
doesn’t, for comparison): 

underlying 
morphemes gloss 

diyel-ne:l-aw ‘guard - passive consequent adjunctive - locative’ 
cawa-hne:l-aw ‘shout - passive consequent adjunctive - locative’ 

If a stem ends in CC and the suffix starts with C, the language instead 
inserts a vowel to break up the first two consonants: 

Y-insertion 8 —► Y / C_C +C (4.9) 

Here are some examples in which an i is inserted into the roots ?ilk- ‘sing’ 
and the roots logw- ‘pulverize’ only when they are followed by a C-initial 
suffix like -hin, ‘past’, not a V-initial suffix like -en, ‘future’: 

surface form gloss 
?ilik-hin ‘sang’ 

?ilken ‘will sing’ 

logiwhin ‘pulverized’ 

logwen ‘will pulverize’ 

Kisseberth (1970) suggested that it was not a coincidence that Yawel¬ 
mani had these particular two rules (and for that matter other related deletion 
rules that we haven’t presented). He noticed that these rules were function¬ 
ally related; in particular, they all are ways of avoiding 3 consonants in a row. 

Another way of stating this generalization is to talk about syllable structure. 
Yawelmani syllables are only allowed to be of the form CYC or CY (where 
C means a consonant and V means a vowel). We say that languages like 
Yawelmani don’t allow complex onsets or complex codas. From the point onset lex 
of view of syllabification, then, these insertions and deletions all happen so coda lex 
as to allow Yawelmani words to be properly syllabified. Since CVCC sylla¬ 
bles aren’t allowed on the surface, CYCC roots must be resyllabified when fiId yllabi " 
they appear on the surface. For example, here are the syllabifications of the 
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Yawelmani words we have discussed and some others; note, for example, 
that the surface syllabification of the CYCC syllables moves the final conso¬ 
nant to the beginning of the next syllable: 


underlying surface 

morphemes syllabification 

?ilk-en ?il.ken 

logw-en log. wen 

logw-hin lo.giw.hin 

xat-en xa.ten 

diyel-hnil-aw di.yel.nei.law 


gloss 


‘will sing’ 

‘will pulverize’ 

‘will pulverize’ 

‘will eat’ 

‘ask - pass. cons, adjunct. - locative’ 


Here’s where Optimality Theory comes in. The basic idea in Optimal¬ 
ity Theory is that the language has various constraints on things like syllable 
structure. These constraints generally apply to the surface form One such 
constraint, ^Complex, says ‘No complex onsets or codas’. Another class 
of constraints requires the surface form to be identical to (faithful to) the 
underlying form. Thus FAITHV says ‘Don’t delete or insert vowels’ and 
FaithC says ‘Don’t delete or insert consonants’. Given an underlying form, 
the GEN function produces all possible surface forms (i.e. every possible 
insertion and deletion of segments with every possible syllabification) and 
they are ranked by the EYAL function using these constraints. Figure 4.17 
shows the architecture. 



The EYAL function works by applying each constraint in ranked order; 
the optimal candidate is one which either violates no constraints, or violates 
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less of them than all the other candidates. This evaluation is usually shown 
on a tableau (plural tableaux). The top left-hand cell shows the input, the tableau 
constraints are listed in order of rank across the top row, and the possible 
outputs along the left-most column. Although there are an infinite number 
of candidates, it is traditional to show only the ones which arc ‘close’; in 
the tableau below wc have shown the output ?ak.pid jusL to make it clear 
that even very different surface forms are to be included. If a form violates 
a constraint, the relevant cell contains *; a !* indicates the fatal violation * 
which causes a candidate to be eliminated. Cells for constraints which arc r 
irrelevant (since a higher-level constraint is already violated) arc shaded. 


/?ilk-hin/ 

^Complex 

FaitiiC 

FaithV 

?ilk.hin 

*] 



?il.khin 

*! 



?il.hin 


*! 

?ijik.hin 



* 

?ak.pid 


*! 



One appeal of Optimality Theoretic derivations is that the constraints 
are presumed to be cross-linguistic generalizations. That is all languages are 
presumed to have some version of faithfulness, some preference for simple 
syllables, and so on. Languages differ in how they rank the constraints; thus 
English, presumably, ranks FAITHC higher than * Complex. (How do we 
know this?) 

Can a derivation in Optimality Theory be implemented by finite-state 
transducers? Frank and Satta (1999), following the foundational work of 
Ellison (1994), showed that (1) if GEN is a regular relation (for example 
assuming the input doesn’t contain context-free trees of some sort), and (2) 
if the number of allowed violations of any constraint has some finite bound, 
then an OT derivation can be computed by finite-stale means. This second 
constraint is relevant because of a property of OT that wc haven’t mentioned: 
if two candidates violate exactly the same number of constraints, the winning 
candidate is the one which has the smallest number of violations of the rele¬ 
vant constraint. 

One way to implement OT as a finite-state system was worked out by 
Kartlunen (1998), following the above-mentioned work and that of Ham¬ 
mond (1997), In Karttunen’s model, GEN is implemented as a finite-state 
transducer which is given an underlying form and produces a set of candi¬ 
date forms. For example for the syllabification example above, GEN would 
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generate all strings that are variants of the input with consonant deletions or 
vowel insertions, and their syllabifications. 

Each constraint is implemented as a filter transducer which lets pass 
only strings which meet the constraint. For legal strings, the transducer thus 
acts as the identity mapping. For example, * Complex would be imple¬ 
mented via a transducer that mapped any input string to itself, unless the 
input string had two consonants in the onset or coda, in which case it would 
be mapped to null. 

The constraints can then be placed in a cascade, in which higher-ranked 
constraints are simply run first, as suggested in Figure 4.18. 



-*-1 

GEN 

^COMPLEX 

FAITHC 

FAITHV 

- 1 - 


Figure 4.18 Version #1 (‘merciless cascade’) of Karttunen’s finite-state 
cascade implementation of OT. 


There is one crucial flaw with the cascade model in Figure 4.18. Recall 
that the constraints-transducers filter out any candidate which violates a con¬ 
straint. But in many derivations, include the proper derivation of ?i.lik.hin, 
even the optimal form still violates a constraint. The cascade in Figure 4.17 
would incorrectly filter it out, leaving no surface form at all! Frank and Satta 
(1999) and Hammond (1997) both point out that it is essential to only en¬ 
force a constraint if it does not reduce the candidate set to zero. Karttunen 
composi- (1998) formalizes this intuition with the lenient composition operator. Fe- 
nient composition is a combination of regular composition and an operation 
called priority union. The basic idea is that if any candidates meet the con¬ 
straint these candidates will be passed through the filter as usual. If no output 
meets the constraint, lenient composition retains all of the candidates. Fig¬ 
ure 4.19 shows the general idea; the interested reader should see Karttunen 
(1998) for the details. Also see Tesar (1995, 1996), Fosler (1996), and Eisner 
(1997) for discussions of other computational issues in OT. 
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-*-1 

GEN 

O L 

^COMPLEX 

Oi 

FAITHC 

Oi 

FAITHV 

\ 

/?ilk—hin/ 

GEN 

?ilk.hin ?il.khin ?il.hin ?ak.pid ?i.lik.hin 

*COMPLEX 

?il.hin ?ak.pid ?i.lik.hin 

FAITHC 

?i.lik.hin 

FAITHV 

[?i.lik.hin] 

Figure 4.19 Version #2 (‘lenient cascade’) of Karttunen’s finite-state cas¬ 
cade implementation of OT, showing a visualization of the candidate popula¬ 
tions that would be passed through each FST constraint. 


4.5 Machine Learning of Phonological Rules 


The task of a machine learning system is to automatically induce a model Earning 
for some domain, given some data from the domain and, sometimes, other 
information as well. Thus a system to learn phonological rules would be 
given at least a set of (surface forms of) words to induce from. A supervised supervised 
algorithm is one which is given the correct answers for some of this data, 
using these answers to induce a model which can generalize to new data it 
hasn’t seen before. An unsupervised algorithm does this purely from the vised per " 
data. While unsupervised algorithms don’t get to see the correct labels for 
the classifications, they can be given hints about the nature of the rules or 
models they should be forming. For example, the knowledge that the models 
will be in the form of automata is itself a kind of hint. Such hints are called 
a learning bias. bws rning 

This section gives a very brief overview of some models of unsuper¬ 
vised machine learning of phonological rules; more details about machine 
learning algorithms will be presented throughout the book. 

Ellison (1992) showed that concepts like the consonant and vowel dis¬ 
tinction, the syllable structure of a language, and harmony relationships 
could be learned by a system based on choosing the model from the set 
of potential models which is the simplest. Simplicity can be measured by 
choosing the model with the minimum coding length, or the highest proba¬ 
bility (we will define these terms in detail in Chapter 6). Daelemans et al 
(1994) used the Instance-Based Generalization algorithm (Aha et al , 1991) 
to learn stress rule for Dutch; the algorithm is a supervised one which is 
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given a number of words together with their stress patterns, and which in¬ 
duces generalizations about the mapping from the sequences of light and 
heavy syllable type in the word (light syllables have no coda consonant; 
heavy syllables have one) to the stress pattern. Tesar and Smolensky (1993) 
show that a system which is given Optimality Theory constraints but not 
their ranking can learn the ranking from data via a simple greedy algorithm. 

Johnson (1984) gives one of the first computational algorithms for 
phonological rule induction. His algorithm works for rules of the form 

(4.10) a —► bjC 

where C is the feature matrix of the segments around a. Johnson’s algorithm 
sets up a system of constraint equations which C must satisfy, by consider¬ 
ing both the positive contexts, i.e., all the contexts Q in which a b occurs on 
the surface, as well as all the negative contexts Cj in which an a occurs on 
the surface. Touretzky et al (1990) extended Johnson’s insight by using the 
version spaces algorithm of Mitchell (1981) to induce phonological rules in 
their Many Maps architecture, which is similar to two-level phonology. Like 
Johnson’s, their system looks at the underlying and surface realizations of 
single segments. For each segment, the system uses the version space algo¬ 
rithm to search for the proper statement of the context. The model also has a 
separate algorithm which handles harmonic effects by looking for multiple 
segmental changes in the same word, and is more general than Johnson’s in 
dealing with epenthesis and deletion rules. 

The algorithm of Gildea and Jurafsky (1996) was designed to induce 
transducers representing two-level rules of the type we have discussed ear¬ 
lier. Like the algorithm of Touretzky et al (1990), Gildea and Jurafsky’s 
algorithm was given sets of pairings of underlying and surface forms. The 
algorithm was based on the OSTIA (Oncina et al , 1993) algorithm, which is 
a general learning algorithm for a subtype of finite-state transducers called 
subsequential transducers. By itself, the OSTIA algorithm was too general 
to learn phonological transducers, even given a large corpus of underlying- 
form/surface-form pairs. Gildea and Jurafsky then augmented the domain- 
independent OSTIA system with three kinds of learning biases which are 
specific to natural language phonology; the main two are Faithfulness (un¬ 
derlying segments tend to be realized similarly on the surface), and Com¬ 
munity (similar segments behave similarly). The resulting system was able 
to learn transducers for flapping in American English, or German consonant 
devoicing. 

Finally, many learning algorithms for phonology are probabilistic. For 
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example Riley (1991) and Withgott and Chen (1993) proposed a decision- 
tree approach to segmental mapping. A decision tree is induced for each 
segment, classifying possible realizations of the segment in terms of contex¬ 
tual factors such as stress and the surrounding segments. Decision trees and 
probabilistic algorithms in general will be defined in Chapter 5 and Chap¬ 
ter 6. 


4.6 Mapping Text to Phones for TTS 

Dearest creature in Creation 

Studying English pronunciation 

I will teach you in my verse 

Sounds like corpse, corps, horse and worse. 

It will keep you, Susy, busy, 

Make your head with heat grow dizzy 

River, rival; tomb, bomb, comb; 

Doll and roll, and some and home. 

Stranger does not rime with anger 
Neither does devour with clangour. 

G.N. Trenite (1870-1946) The Chaos , reprinted 
in Witten (1982). 

Now that we have learned the basic inventory of phones in English and 
seen how to model phonological rules, we are ready to study the problem of 
mapping from an orthographic or text word to its pronunciation. 


Pronunciation dictionaries 

An important component of this mapping is a pronunciation dictionary. 
These dictionaries are actually used in both ASR and TTS systems, although 
because of the different needs of these two areas the contents of the dictio¬ 
naries are somewhat different. 

The simplest pronunciation dictionaries just have a list of words and 
their pronunciations: 
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Word 

Pronunciation 

Word 

Pronunciation 

cat 

[kact] 

goose 

[gus] 

cats 

[kaets] 

geese 

[gis] 

Pig 

[prg] 

hedgehog 

['hed 3 .hog] 

pigs 

[pjgz] 

hedgehogs 

['hedg.hogz] 

fox 




foxes 

['fak.siz] 




Three large, commonly-used, on-line pronunciation dictionaries in this 
format are PRONLEX, CMUdict, and CELEX. These are used for speech 
recognition and can also be adapted for use in speech synthesis. The PRON¬ 
LEX dictionary (LDC, 1995) was designed for speech recognition applica¬ 
tions and contains pronunciations for 90,694 wordforms. It covers all the 
words used in many years of the Wall Street Journal, as well as the Switch¬ 
board Corpus. The CMU Pronouncing Dictionary was also developed 
for ASR purposes and has pronunciations for about 100,000 wordforms. 
The CELEX dictionary (Celex, 1993) includes all the words in the Oxford 
Advanced Learner’s Dictionary (1974) (41,000 lemmata) and the Longman 
Dictionary of Contemporary English (1978) (53,000 lemmata), in total it has 
pronunciations for 160,595 wordforms. Its pronunciations are British while 
the other two are American. Each dictionary uses a different phone set; the 
CMU and PRONLEX phonesets are derived from the ARPAbet, while the 
CELEX dictionary is derived from the IPA. All three represent three levels 
of stress: primary stress, secondary stress, and no stress. Figure 4.20 shows 
the pronunciation of the word armadillo in all three dictionaries. 


Dictionary 

Pronunciation 

IPA Version 

Pronlex 

CMU 

CELEX 

+arm.xd’Il.o 

AA2 R M AH0 D IH1 L OWO 
”#-m@-’dI-15 

[arma'dilou] 

[armA'dilou] 

[cu.ma.'di.lau] 

Figure 4.20 The pronunciation of the word armadillo in three dictionaries. 
Rather than explain special symbols we have given an IPA equivalent for each 
pronunciation. The CMU dictionary represents unstressed vowels ([a], [i], etc.) 
by giving a 0 stress level to the vowel (we represented this by underlining in 
the IPA form). Note the British r-dropping and use of the [au] rather than [ou] 
vowel in the CELEX pronunciation. 


Often two distinct words are spelled the same (they are homographs) 
but pronounced differently. For example the verb wind (‘You need to wind 
this up more neatly’) is pronounced [waind] while the noun wind (‘blow, 
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blow, thou winter wind’) is pronounced [wind]. This is essential for TTS 
applications (since in a given context the system needs to say one or the 
other) but for some reason is usually ignored in current speech recognition 
systems. Printed pronunciation dictionaries give distinct pronunciations for 
each part of speech; CELEX does as well. Since they were designed for 
ASR, Pronlex and CMU, although they give two pronunciations for the form 
wind, don’t specify which one is used for which part of speech. 

Dictionaries often don’t include many proper names. This is a seri¬ 
ous problem for many applications; Liberman and Church (1992) report that 
21% of the word tokens in their 33 million word 1988 AP newswire cor¬ 
pus were names. Furthermore, they report that a list obtained in 1987 from 
the Donnelly marketing organization contains 1.5 million names (covering 
72 million households in the United States). But only about 1000 of the 
52477 lemmas in CELEX (which is based on traditional dictionaries) are 
proper names. By contrast Pronlex includes 20,000 names; this is still only 
a small fraction of the 1.5 million. Very few dictionaries give pronunciations 
for entries like Dr ., which as Liberman and Church (1992) point out can be 
“doctor” or “drive”, or 2/3 , which can be “two thirds” or “February third” or 
“two slash three”. 

No dictionaries currently have good models for the pronunciation of 
function words (and, I, a, the, of, etc). This is because the variation in these 
words due to phonetic context is so great. Usually the dictionaries include 
some simple baseform (such as [5i] for the and use other algorithms to de¬ 
rive the variation due to context; Chapter 5 will treat the issue of modeling 
contextual pronunciation variation for words of this sort. 

One significant difference between TTS and ASR dictionaries is that 
TTS dictionaries do not have to represent dialectal variation; thus where 
a very accurate ASR dictionary needs to represent both pronunciations of 
either and tomato, a TTS dictionary can choose one. 

Beyond Dictionary Lookup: Text Analysis 

Mapping from text to phones relies on the kind of pronunciation dictionaries 
we talked about in the last section. As we suggested before, one way to map 
text-to-phones would be to look up each word in a pronunciation dictionary 
and read the string of phones out of the dictionary. This method would work 
fine for any word that we can put in the dictionary in advance. But as we 
saw in Chapter 3, it’s not possible to represent every word in English (or any 
other language) in advance. Both speech synthesis and speech recognition 
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systems need to be able to guess at the pronunciation of words that are not 
in their dictionary. This section will first examine the kinds of words that 
are likely to be missing in a pronunciation dictionary, and then show how 
the finite-state transducers of Chapter 3 can be used to model the basic task 
of text-to-phones. Chapter 5 will introduce variation in pronunciation and 
introduce probabilistic techniques for modeling it. 

Three of the most important cases where we cannot rely on a word 
dictionary involve names, morphological productivity, and numbers. As 
a brief example, we arbitrarily selected a brief (561 word) movie review that 
appeared in today’s issue of the New York Times. The review, of Vincent 
Gallo’s ’’Buffalo ’66”, was written by Janet Maslin. Here’s the beginning of 
the article: 

In Vincent Gallo’s “Buffalo ’66,” Billy Brown (Gallo) steals a 
blond kewpie doll named Layla (Christina Ricci) out of her tap 
dancing class and browbeats her into masquerading as his wife at 
a dinner with his parents. Billy hectors, cajoles and tries to bribe 
Layla. (“You can eat all the food you want. Just make me look 
good”) He threatens both that he will kill her and that he won’t 
be her best friend. He bullies her outrageously but with such 
crazy brio and jittery persistence that Layla falls for him. Gallo’s 
film, a deadpan original mixing pathos with bravado, works on 
its audience in much the same way. 

We then took two large commonly-used on-line pronunciation dictionaries; 
the PRONLEX dictionary, that contains pronunciations for 90,694 word- 
forms and includes coverage of many years of the Wall Street Journal, as 
well as the Switchboard Corpus, and the larger CELEX dictionary, which 
has pronunciations for 160,595 wordforms. The combined dictionaries have 
approximately 194,000 pronunciations. Of the 561 words in the movie re¬ 
view, 16 (3%) did not have pronunciations in these two dictionaries (not 
counting two hyphenated words, baby-blue and hollow-eyed). Here they 
are: 

Names Inflected Names Numbers Other 


Aki Gazzara Gallo’s 

’66 

c’mere 

Anjelica Kaurismaki 


indie 

Arquette Kusturica 


kewpie 

Buscemi Layla 


sexpot 

Gallo Rosanna 




Some of these missing words can be found by increasing the dictionary 
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size (for example Wells’s (1990) definitive (but not on-line) pronunciation 
dictionary of English does have sexpot and kewpie). But the rest need to 
generated on-line. 

Names are a large problem for pronunciation dictionaries. It is diffi¬ 
cult or impossible to list in advance all proper names in English; furthermore 
they may come from any language, and may have variable spellings. Most 
potential applications for TTS or ASR involve names; for example names 
are essentially in telephony applications (directory assistance, call routing). 
Corporate names are important in many applications and are created con¬ 
stantly ( CoComp , Intel , Cisco). Medical speech applications (such as tran¬ 
scriptions of doctor-patient interviews) require pronunciations of names of 
pharmaceuticals; there are some off-line medical pronunciation dictionaries 
but they are known to be extremely inaccurate (Markey and Ward, 1997). 
Recall the figure of 1.5 million names mentioned above, and Liberman and 
Church’s (1992) finding that 21% of the word tokens in their 33 million word 
1988 AP news wire corpus were names. 

Morphology is a particular problem for many languages other than En¬ 
glish. For languages with very productive morphology it is computationally 
infeasible to represent every possible word; recall this Turkish example: 

(4.11) uygaiiastiramadiklanmizdanmissinizcasina 

uygar +la§ +tir +ama +dik +lar +imiz 

civilized +BEC +CAUS +NegAble +PPART +PL +p1pl 
+dan +mi§ +siniz +casina 
+ABL +PAST +2PL +Aslf 

‘(behaving) as if you are among those whom we could not 
civilize/cause to become civilized’ 

Even a language as similar to English as German has greater ability to 
create words; Sproat etal (1998) note the spontaneously created German ex¬ 
ample Unerfindlichkeitsunterstellung (‘allegation of incomprehensibility’). 

But even in English, morphologically simple though it is, morphologi¬ 
cal knowledge is necessary for pronunciation modeling. For example names 
and acronyms are often inflected ( Gallo's , IBM's , DATs , Syntex’s) as are 
new words (faxes , indies). Furthermore, we can’t just ‘add s’ on to the pro¬ 
nunciation of the uninflected forms, because as the last section showed, the 
possessive - ’s and plural -s suffix in English are pronounced differently in 
different contexts; Syntex’s is pronounced [sinteksiz], faxes is pronounced 
[fasksiz], IBM’s is pronounced [aibijemz], and DATs is pronounced [daets]. 



124 


Chapter 4. Computational Phonology and Text-to-Speech 


Finally, pronouncing numbers is a particularly difficult problem. The 
’66 in Buffalo ’66 is pronounced [sikstisiks] not [srkssiks]. The most natural 
way to pronounce the phone number ‘947-2020’ is probably ‘nine’-‘four’- 
‘seven’ - ‘twenty’ - ‘twenty’ rather than ‘nine’ - ‘four’ - ‘seven’ - ‘two’ - ‘zero’ - ‘two’ - 
‘zero’. Liberman and Church (1992) note that there are five main ways to 
pronounce a string of digits (although others are possible): 

• Serial: each digit is pronounced separately — 8765 is “eight seven six 
five” 

• Combined: the digit string is pronounced as a single integer, with all 
position labels read out — “eight thousand seven hundred sixty five” 

• Paired: each pair of digits is pronounced as an integer; if there is an 
odd number of digits the first one is pronounced by itself — “eighty- 
seven sixty-five”. 

• Hundreds: strings of four digits can be pronounced as counts of hun¬ 
dreds — “eighty-seven hundred (and) sixty-five” 

• Trailing Unit: strings than end in zeros are pronounced serially until 
the last nonzero digit, which is pronounced followed by the appropriate 
unit — 8765000 is “eight seven six five thousand”. 

Pronunciation of numbers and these five methods are discussed further 
in Exercises 4.5 and 4.6. 

An FST-based pronunciation lexicon 

Early work in pronunciation modeling for text-to-speech systems (such as 
the seminal MITalk system Allen et al (1987)) relied heavily on letter-to- 
so!jn e d" to " sound rules. Each rule specified how a letter or combination of letters was 
mapped to phones; here is a fragment of such a rule-base from Witten (1982): 

Fragment Pronunciation 


-p- 

[p] 

-ph- 

[f] 

-phe- 

[fi] 

-phes- 

[fiz] 

-place- 

[pleis] 

-placi- 

[pleisi] 

-plement- 

[pliment] 


Such systems consisted of a long list of such rules and a very small dic¬ 
tionary of exceptions (often function words such as a, are, as, both, do, does , 
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etc.). More recent systems have completely inverted the algorithm, relying 
on very large dictionaries, with letter-to-sound rules only used for the small 
number of words that are neither in the dictionary nor are morphological 
variants of words in the dictionary. How can these large dictionaries be rep¬ 
resented in a way that allows for morphological productivity? Luckily, these 
morphological issues in pronunciation (adding inflectional suffixes, slight 
pronunciation changes at the juncture of two morphemes, etc) are identical 
to the morphological issues in spelling that we saw in Chapter 3. Indeed, 
(Sproat, 1998b) and colleagues have worked out the use of transducers for 
text-to-speech. We might break down their transducer approach into five 
components: 

1. an FST to represent the pronunciation of individual words and mor¬ 
phemes in the lexicon 

2. FSAs to represent the possible sequencing of morphemes 

3. individual FSTs for each pronunciation rule (for example expressing 
the pronunciation of -s in different contexts 

4. heuristics and letter-to-sound (LTS) rules/transducers used to model 
the pronunciations of names and acronyms 

5. default letter-to-sound rules/transducers for any other unknown words 

We will limit our discussion here to the first four components; those 
interested in letter-to-sound rules should see (Allen et al , 1987). These first 
components will turn out to be simple extensions of the FST components 
we saw in Chapter 3 and on page 109. The first is the representation of the 
lexical base form of each word; recall that ‘base’ form means the uninflected 
form of the word. The previous base forms were stored in orthographic 
representation; we will need to augment each of them with the correct lexical 
phonological representation. Figure 4.21 shows the original and the updated 
lexical entries: 

The second part of our FST system is the finite state machinery to 
model morphology. We will give only one example: the nominal plural suf¬ 
fix -s. Figure 4.22 in Chapter 3 shows the automaton for English plurals, 
updated to handle pronunciation as well. The only change was the addi¬ 
tion of the [s] pronunciation for the suffix, and £ pronunciations for all the 
morphological features. 

We can compose the inflection FSA in Figure 4.22 with a transducer 
implementing the baseform lexicon in Figure 4.21 to produce an inflectionally- 
enriched lexicon that has singular and plural nouns. The resulting mini¬ 
lexicon is shown in Figure 4.23. 
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Orthographic Lexicon 

Lexicon 

Regular Nouns 

cat 

c k a ae t|t 

fox 

f f o a x ks 

dog 

d d o a g g 

Irregular Singular Nouns 

goose 

g g oo u s s e & 

Irregular Plural Nouns 

g o:e o:e s e 

g g oo u:ee i s s e 8 


Figure 4.21 FST-based lexicon, extending the lexicon in the table on page 
74 in Chapter 3. Each symbol in the lexicon is now a pair of symbols separated 
by ’I’, one representing the ‘orthographic’ lexical entry and one the ‘phono¬ 
logical’ lexical entry. The irregular plural geese also pre-specifies the contents 
of the intermediate tape ‘:ee|i’. 



Figure 4.22 FST for the nominal singular and plural inflection. The au¬ 
tomaton adds the morphological features [+N], [+PL], and [+SG] at the lexi¬ 
cal level where relevant, and also adds the plural suffix s\z (at the intermediate 
level). We will discuss below why we represent the pronunciation of -s as z 
rather than s. 


The lexicon shown in Figure 4.23 has two levels, an underlying or 
‘lexical’ level and an intermediate level. The only thing that remains is to add 
transducers which apply spelling rules and pronunciation rules to map the 
intermediate level into the surface level. These include the various spelling 
rules discussed on page 76 and the pronunciation rules starting on page 104. 

The lexicon and these phonological rules and the orthographic rules 
from Chapter 3 can now be used to map between a lexical representation 
(containing both orthographic and phonological strings) and a surface rep¬ 
resentation (containing both orthographic and phonological strings). As we 
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saw in Chapter 3, this mapping can be run from surface to lexical form, or 
from lexical to surface form; Figure 4.24 shows the architecture. Recall that 
the lexicon FST maps between the ‘lexical’ level, with its stems and mor¬ 
phological features, and an ‘intermediate’ level which represents a simple 
concatenation of morphemes. Then a host of FSTs, each representing either 
a single spelling rule constraint or a single phonological constraint, all run 
in parallel so as to map between this intermediate level and the surface level. 
Each level has both orthographic and phonological representations. For text- 
to-speech applications in which the input is a lexical form (for example for 
text generation, where the system knows the lexical identity of the word, its 
part of speech, its inflection, etc), the cascade of FSTs can map from lexi¬ 
cal form to surface pronunciation. For text-to-speech applications in which 
the input is a surface spelling (for example for ‘reading text out loud’ ap¬ 
plications), the cascade of FSTs can map from surface orthographic form to 
surface pronunciation via the underlying lexical form. 

Finally let us say a few words about names and acronyms. Acronyms 
can be spelled with or without periods (I.R.S. or IRS. Acronyms with pe¬ 
riods are usually pronounced by spelling them out ([aiares]). Acronyms 
that usually appear without periods (AIDS, ANSI, ASCAP) may either be 
spelled out or pronounced as a word; so AIDS is usually pronounced the 
same as the third-person form of the verb aid. Liberman and Church (1992) 
suggest keeping a small dictionary of the acronyms that are pronounced as 
words, and spelling out the rest. Their method for dealing with names begins 
with a dictionary of the pronunciations of 50,000 names, and then applies a 
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Figure 4.24 Mapping between the lexicon and surface form for orthogra¬ 
phy and phonology simultaneously. The system can be used to map from a 
lexical entry to its surface pronunciation or from surface orthography to sur¬ 
face pronunciation via the lexical entry. 


small number of affix-stripping rules (akin to the Porter Stemmer of Chap¬ 
ter 3), rhyming heuristics, and letter-to-sound rules to increase the coverage. 
Liberman and Church (1992) took the most frequent quarter million words 
in the Donnelly list. They found that the 50,000 word dictionary covered 
59% of these 250,000 name tokens. Adding stress-neutral suffixes like -s, 
-ville , and -son (Walters = Walter + s, Abelson = Abel + son, Lucasville 
= Lucas + ville) increased the coverage to 84%. Adding name-name com¬ 
pounds ( Abdulhussein , Baumgaertner ) and rhyming heuristics increased the 
coverage to 89%. (The rhyming heuristics used letter-to-sound rules for the 
beginning of the word and then found a rhyming word to help pronounce the 
end; so Plotsky was pronounced by using the LTS rule for PI- and guessing 
-otsky from Trotsky They then added a number of more complicated morpho¬ 
logical rules (prefixes like O'Brien ), stress-changing suffixes ( Adamovich ), 
suffix-exchanges (Bierstadt - Bierbaum - baum + stadt) and used a system 
of letter-to-sound rules for the remainder. This system was not implemented 
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as an FST; Exercise 4.11 will address some of the issues in turning such a 
set of rules into an FST. Readers interested in further details about names, 
acronyms and other unknown words should consult sources such as Liber¬ 
man and Church (1992), Vitale (1991), and Allen et al. (1987). 


4.7 Prosody in TTS 

The orthography to phone transduction process just described produces the 
main component for the input to the part of a TTS system which actually 
generates the speech. Another important part of the input is a specification 
of the prosody. The term prosody is generally used to refer to aspects of a prosody 
sentence’s pronunciation which aren’t described by the sequence of phones 
derived from the lexicon. Prosody operates on longer linguistic units than 
phones, and hence is sometimes called the study of suprasegmental phe- mental eg ' 
nomena. 

Phonological Aspects of Prosody 

There are three main phonological aspects to prosody: prominence, struc- prominence 
ture and tune. structure 

As 102 discussed, prominence is a broad term used to cover stress tune 
and accent. Prominence is a property of syllables, and is often described in stress 
a relative manner, by saying one syllable is more prominent than another, accent 
Pronunciation lexicons mark lexical stress; for example table has its stress 
on the first syllable, while machine has its stress on the second. Function 
words like there, the or a are usually unaccented altogether. When words are 
joined together, their accentual patterns combine and form a larger accent 
pattern for the whole utterance. There are some regularities in how accents 
combine. For example adjective-noun combinations like like new truck are 
likely to have accent on the right word (new *truck, while noun-noun com¬ 
pounds like Hree surgeon are likely to have accent on the left. In generally, 
however, there are many exceptions to these rules, and so accent prediction 
is quite complex. For example the noun-noun compound *apple cake has the 
accent on the first word while the noun-noun compound apple *pie or city 
*hall both have the accent on the second word (Liberman and Sproat, 1992; 

Sproat, 1994, 1998a). Furthermore, rhythm plays a role in keeping the ac¬ 
cented syllables spread apart a bit; thus city *hall and Sparking lot combine 
as *city hall Sparking lot (Liberman and Prince, 1977). Finally, the location 
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of accent is very strongly affected by the discourse factors we will describe 
in Chapter 18 and Chapter 19; in particular new or focused words or phrases 
often receive accent. 

Sentences have prosodic structure in the sense that some words seem to 
group naturally together and some words seem to have a noticeable break or 
disjuncture between them. Often prosodic structure is described in terms of 
prosodic phrasing, meaning that an utterance has a prosodic phrase struc¬ 
ture in a similar way to it having a syntactic phrase structure. For example, in 
the sentence I wanted to go to London, but could only get tickets for France 
there seems to be two main prosodic phrases, their boundary occurring at the 
comma. Commonly used terms for these larger prosodic units include into- 
national phrase or IP (Beckman and Pierrehumbert, 1986), intonation unit 
(Du Bois et al, 1983), and tone unit (Crystal, 1969). Furthermore, in the 
first phrase, there seems to be another set of lesser prosodic phrase bound¬ 
aries (often called intermediate phrases) that split up the words as follows 
I wanted \ to go \ to London. The exact definitions of prosodic phrases 
and subphrases and their relation to syntactic phrases like clauses and noun 
phrases and semantic units have been and still are the topic of much debate 
(Chomsky and Halle, 1968; Langendoen, 1975; Streeter, 1978; Hirschberg 
and Pierrehumbert, 1986; Selkirk, 1986; Nespor and Vogel, 1986; Croft, 
1995; Ladd, 1996; Ford and Thompson, 1996; Ford et al. 9 1996). Despite 
these complications, algorithms have been proposed which attempt to au¬ 
tomatically break an input text sentence into intonational phrases. For ex¬ 
ample Wang and Hirschberg (1992), Ostendorf and Veilleux (1994), Tay¬ 
lor and Black (1998), and others have built statistical models (incorporating 
probabilistic predictors such as the CART-style decision trees to be defined 
in Chapter 5) for predicting intonational phrase boundaries based on such 
features as the parts of speech of the surrounding words, the length of the 
utterance in words and seconds, the distance of the potential boundary from 
the beginning or ending of the utterance, and whether the surrounding words 
are accented. 

Two utterances with the same prominence and phrasing patterns can 
still differ prosodically by having different tunes. Tune refers to the into¬ 
national melody of an utterance. Consider the utterance oh, really. Without 
varying the phrasing or stress, it is still possible to have many variants of 
this by varying the intonational tune. For example, we might have an ex¬ 
cited version oh, really! (in the context of a reply to a statement that you’ve 
just won the lottery); a sceptical version oh, really? — in the context of not 
being sure that the speaker is being honest; to an angry oh, really! indicat- 
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ing displeasure. Intonational tunes can be broken into component parts, the 
most important of which is the pitch accent. Pitch accents occur on stressed £5? E H NT 
syllables and form a characteristic pattern in the FO contour (as explained be¬ 
low). Depending on the type of pattern, different effects (such as those just 
outlined above) can be produced. A popular model of pitch accent classifi¬ 
cation is the Pierrehumbert or ToBI model (Pierrehumbert, 1980; Silverman 
et al , 1992), which says there are 5 pitch accents in English, which are made 
from combining two simple tones (high H, and low L) in various ways. A 
H+L pattern forms a fall, while a L+H pattern forms a rise. An asterisk (*) 
is also used to indicate which tone falls on the stressed syllable. This gives 
an inventory of H*, L*, L+H*, L*+H, H+L* (a sixth pitch accent H*+L 
which was present in early versions of the model was later abandoned). Our 
three examples of oh, really might be marked with the accents L+H*, L*+H 
and L* respectively. In addition to pitch accents, this model also has two 
phrase accents L- and H- and two boundary tones L% and H%, which are 
used at the ends of phrases to control whether the intonational tune rises or 
falls. 

Other intonational modals differ from ToBI by not using discrete phone¬ 
mic classes for intonation accents. For example the Tilt (Taylor, 2000) and 
Fujisaki models (Fujisaki and Ohno, 1997) use continuous parameters rather 
than discrete categories to model pitch accents. These researchers argue that 
while the discrete models are often easier to visualize and work with, con¬ 
tinuous models may be more robust and more accurate for computational 
purposes. 

Phonetic or Acoustic Aspects of Prosody 

The three phonological factors interact and are realized by a number of dif¬ 
ferent phonetic or acoustic phenomena. Prominent syllables are generally 
louder and longer that non-prominent syllables. Prosodic phrase boundaries 
are often accompanied by pauses, by lengthening of the syllable just before 
the boundary, and sometimes lowering of pitch at the boundary. Intonational 
tune is manifested in the fundamental frequency (FO) contour. 

Prosody in Speech Synthesis 

A major task for a TTS system is to generate appropriate linguistic repre¬ 
sentations of prosody, and from them generate appropriate acoustic patterns 
which will be manifested in the output speech waveform. The output of 
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a TTS system with such a prosodic component is a sequence of phones, 
each of which has a duration and an FO (pitch) value. The duration of each 
phone is dependent on the phonetic context (see Chapter 7). The FO value 
is influenced by the factors discussed above, including the lexical stress, the 
accented or focused element in the sentence, and the intonational tune of the 
utterance (for example a final rise for questions). Figure 4.25 shows some 
sample TTS output from the FESTIVAL (Black et al ., 1999) speech synthe¬ 
sis system for the sentence Do you really want to see all of it?. This output, 
together with the FO values shown in Figure 4.26 would be the input to the 
waveform synthesis component described in Chapter 7. The durations here 
are computed by a CART-style decision tree (Riley, 1992). 
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Figure 4.25 Output of the FESTIVAL (Black et al , 1999) generator for the sentence Do 
you really want to see all of it?. The exact intonation countour is shown in Figure 4.26. 



As was suggested above, determining the proper prosodic pattern for 
a sentence is difficult, as real-world knowledge and semantic information is 
needed to know which syllables to accent, and which tune to apply. This sort 
of information is difficult to extract from the text and hence prosody modules 
often aim to produce a “neutral declarative” version of the input text, which 
assume the sentence should be spoken in a default way with no reference to 
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discourse history or real-world events. This is one of the main reasons why 
intonation in TTS often sounds “wooden”. 


4.8 Human Processing of Phonology and Morphology 

Chapter 3 suggested that productive morphology plays a psychologically real 
role in the human lexicon. But we stopped short of a detailed model of how 
the morphology might be represented. Now that we have studied phono¬ 
logical structure and phonological learning, we return to the psychological 
question of the representation of morphological/phonological knowledge. 

One view of human morphological or phonological processing might 
be that it distinguishes productive, regular morphology from irregular or ex¬ 
ceptional morphology. Under this view, the regular past tense morpheme 
- ed , for example, could be mentally represented as a rule which would be 
applied to verbs like walk to produce walked. Irregular past tense verbs like 
broke , sang , and brought , on the other hand, would simply be stored as part 
of a lexical representation, and the rule wouldn’t apply to these. Thus this 
proposal strongly distinguishes representation via rules from representation 
via lexical listing. 

This proposal seems sensible, and is indeed identical to the transducer- 
based models we have presented in these last two chapters. Unfortunately, 
this simple model seems to be wrong, One problem is that the irregular verbs 
themselves show a good deal of phonological subregularity. For example, larity gu ' 
the x/ae alternation relating ring and rang also relates sing and sang and swim 
and swam (Bybee and Slobin, 1982). Children learning the language of¬ 
ten extend this pattern to incorrectly produce bring-brang , and adults often 
make speech errors showing effects of this subregular pattern. A second 
problem is that there is psychological evidence that high-frequency regular 
inflected forms ( needed , covered) are stored in the lexicon just like the stems 
cover and need (Losiewicz, 1992). Finally, word and morpheme frequency 
in general seems to play an important role in human processing. 

Arguments like these led to ‘data-driven’ models of morphological 
learning and representation, which essentially store all the inflected forms 
they have seen. These models generalize to new forms by a kind of analogy; 
regular morphology is just like subregular morphology but acquires rule-like 
trappings simply because it occurs more often. Such models include the 
computational connectionist or Parallel Distributed Processing model of tionist 

PARALLEL 

Rumelhart and McClelland (1986) and subsequent improvements (Plunkett distributed 
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and Marchman, 1991; MacWhinney and Leinbach, 1991) and the similar 
network model of Bybee (1985, 1995). In these models, the behavior of 
regular morphemes like -ed emerges from its frequent interaction with other 
forms. Proponents of the rule-based view of morphology such as Pinker 
and Prince (1988), Marcus et al (1995), and others, have criticized the con- 
nectionist models and proposed a compromise dual processing model, in 
which regular forms like -ed are represent as symbolic rules, but subregular 
examples (broke, brought ) are represented by connectionist-style pattern as- 
sociators. This debate between the connectionist and dual processing models 
has deep implications for mental representation of all kinds of regular rule- 
based behavior and is one of the most interesting open questions in human 
language processing. Chapter 7 will briefly discuss connectionist models of 
human speech processing; readers who are further interested in connection¬ 
ist models should consult the references above and textbooks like Anderson 
(1995). 

4.9 Summary 

This chapter has introduced many of the important notions we need to un¬ 
derstand spoken language processing. The main points are as follows: 

• We can represent the pronunciation of words in terms of units called 
phones. The standard system for representing phones is the Interna¬ 
tional Phonetic Alphabet or IPA. An alternative English-only tran¬ 
scription system that uses ASCII letters is the ARPAbet. 

• Phones can be described by how they are produced articulatorily by 
the vocal organs; consonants are defined in terms of their place and 
manner of articulation and voicing, vowels by their height and back- 
ness. 

• A phoneme is a generalization or abstraction over different phonetic 
realizations. Allophonic rules express how a phoneme is realized in a 
given context. 

• Transducers can be used to model phonological rules just as they were 
used in Chapter 3 to model spelling rules. Two-level morphology is 
a theory of morphology/phonology which models phonological rules 
as finite-state well-formedness constraints on the mapping between 
lexical and surface form. 
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• Pronunciation dictionaries are used for both text-to-speech and au¬ 
tomatic speech recognition. They give the pronunciation of words as 
strings of phones, sometimes including syllabification and stress. Most 
on-line pronunciation dictionaries have on the order of 100,000 words 
but still lack many names, acronyms, and inflected forms. 

• The text-analysis component of a text-to-speech system maps from 
orthography to strings of phones. This is usually done with a large 
dictionary augmented with a system (such as a transducer) for handling 
productive morphology, pronunciation changes, names, numbers, and 
acronyms. 


Bibliographical and Historical Notes 

The major insights of articulatory phonetics date to the linguists of 800-150 
B.C. India. They invented the concepts of place and manner of articulation, 
worked out the glottal mechanism of voicing, and understood the concept of 
assimilation. European science did not catch up with the Indian phoneticians 
until over 2000 years later, in the late 19th century. The Greeks did have 
some rudimentary phonetic knowledge; by the time of Plato’s Theaetetus and 
Cratylus , for example, distinguished vowels from consonants, and stop con¬ 
sonants from continuants. The Stoics developed the idea of the syllable and 
were aware of phonotactic constraints on possible words. An unknown Ice¬ 
landic scholar of the twelfth century exploited the concept of the phoneme, 
proposed a phonemic writing system for Icelandic, including diacritics for 
length and nasality. But his text remained unpublished until 1818 and even 
then was largely unknown outside Scandinavia (Robins, 1967). The modem 
era of phonetics is usually said to have begun with (1877), who proposed 
what is essentially the phoneme in his Handbook of Phonetics (1877). He 
also devised an alphabet for transcription and distinguished between broad 
and narrow transcription, proposing many ideas that were eventually incor¬ 
porated into the IPA. Sweet was considered the best practicing phonetician 
of his time; he made the first scientific recordings of languages for phonetic 
purposes, and advanced the start of the art of articulatory description. He 
was also infamously difficult to get along with, a trait that is well captured 
in the stage character that George Bernard Shaw modeled after him: Henry 
Higgins. The phoneme was first named by the Polish scholar Baudouin de 
Courtenay, who published his theories in 1894. 
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The idea that phonological rules could be modeled as regular rela¬ 
tions dates to Johnson (1972), who showed that any phonological system 
that didn’t allow rules to apply to their own output (i.e. systems that did not 
have recursive rules) could be modeled with regular relations (or finite-state 
transducers). Virtually all phonological rules that had been formulated at 
the time had this property (except some rules with integral-valued features, 
like early stress and tone rules). Johnson’s insight unfortunately did not at¬ 
tract the attention of the community, and was independently discovered by 
Roland Kaplan and Martin Kay; see Chapter 3 for the rest of the history of 
two-level morphology. Karttunen (1993) gives a tutorial introduction to two- 
level morphology which includes more of the advanced details than we were 
able to present here. 

Readers interested in phonology should consult (Goldsmith, 1995) as a 
reference on phonological theory in general and Archangeli and Langendoen 
(1997) on Optimality Theory. 

Two classic text-to-speech synthesis systems are described in Allen 
et al. (1987) (the MITalk system) and Sproat (1998b) (the Bell Labs sys¬ 
tem). The pronunciation problem in text-to-speech synthesis is an ongoing 
research area; much of the current research focuses on prosody. Interested 
readers should consult the proceedings of the main speech engineering con¬ 
ferences: ICSLP (the International Conference on Spoken Language Pro¬ 
cessing), IEEE ICASSP (the International Conference on Acoustics, Speech, 
and Signal Processing), and EUROSPEECH. 

Students with further interest in transcription and articulatory phonet¬ 
ics should consult an introductory phonetics textbook such as Ladefoged 
(1993). Pullum and Ladusaw (1996) is a comprehensive guide to each of the 
symbols and diacritics of the IPA. Many phonetics papers of computational 
interest are to be found in the Journal of the Acoustical Society of America 
(JASAf Computer Speech and Language , and and Speech Communication. 


Exercises 

4.1 Find the mistakes in the IPA transcriptions of the following words: 
a. “three” [Sri] 
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b. “sing” [sing] 

c. “eyes” [ais] 

d. “study” [studi] 

e. “though” [0ou] 

f. “planning” [plunirj] 

g. “slight” [slit] 

4.2 Translate the pronunciations of the following color words from the IPA 
into the ARPAbet (and make a note if you think you pronounce them differ¬ 
ently than this!) 

a. [red] 

b. [blu] 

c. [grin] 

d. ['jelou] 

e. [black] 

f. [wait] 

g. ['orind 3 ] 

h. ['pypl] 

i. [pjus] 

j. [toup] 

4.3 Transcribe Ira Gershwin’s two pronunciations of ‘either’ in IPA and in 
the ARPAbet. 

4.4 Transcribe the following words in both the ARPAbet and the IPA. 

a. dark 

b. suit 

c. greasy 

d. wash 

e. water 

4.5 Write an FST which correctly pronounces strings of dollar amounts 
like $45, $320, and $4100. If there are multiple ways to pronounce a number 
you may pick your favorite way. 

4.6 Write an FST which correctly pronounces 7-digit phone numbers like 
555-1212, 555-1300, and so on. You should use a combination of the paired 
and trailing unit methods of pronunciation for the last four digits. 



138 


Chapter 4. Computational Phonology and Text-to-Speech 


CANADIAN 

RAISING 


4.7 Build an automaton for rule (4.5). 

4.8 One difference between one dialect of Canadian English and most di¬ 
alects of American English is called Canadian raising. (Bromberger and 
Halle, 1989) note that some Canadian dialects of English raise /ai/ to [ai] 
and /a u/ to [aa] before a voiceless consonant. A simplified version of the 
rule dealing only with /ai/ can be stated as: 


/ai/ —> [ai] /_ 


C 

—voice 


(4.12) 


This rule has an interesting interaction with the flapping rule. In some 
Canadian dialects the word rider and writer are pronounced differently: rider 
is pronounced [rairaf 1 ] while writer is pronounced [rAira^]. Write a two-level 
rule and an automaton for both the raising rule and the flapping rule which 
correctly models this distinction. You may make simplifying assumptions as 
needed. 


4.9 Write the lexical entry for the pronunciation of the English past tense 
(preterite) suffix -d, and the two level-rules that express the difference in its 
pronunciation depending on the previous context. Don’t worry about the 
spelling rules. (Hint: make sure you correctly handle the pronunciation of 
the past tenses of the words add , pat , bake , and bag.) 

4.10 Write two-level rules for the Yawelmani Yokuts phenomena of Har¬ 
mony, Shortening, and Lowering introduced on page 110. Make sure your 
rules are capable of running in parallel. 

4.11 Find 10 stress-neutral name suffixes (look in a phone book) and sketch 
an FST which would model the pronunciation of names with or without suf¬ 
fixes. 




PROBABILISTIC MODELS 
OF PRONUNCIATION 
AND SPELLING 


Algernon; But my own sweet Cecily, I have never written you 
any letters. 

Cecily; You need hardly remind me of that, Ernest. I remember 
only too well that I was forced to write your letters for you. I 
wrote always three times a week, and sometimes oftener. 
Algernon; Oh, do let me read them, Cecily? 

Cecily; Oh, I couldn't possibly. They would make you far too 
conceited. The three you wrote me after I had broken off the en¬ 
gagement are so beautiful, and so badly spelled, that even now I 
can hardly read them without crying a little. 

Oscar Wilde, The Importance of being Ernest 


Like Oscar Wilde’s Cecily, the characters in Gilbert and Sullivan’s op¬ 
erettas also seem somewhat anxious about spelling. The Gondoliers' Giu¬ 
seppe worries that his private secretary is ‘shaky in his spelling’ while Iolan- 
the’s Phyllis can ‘spell every word that she uses’. While an investigation into 
the role of proper spelling in class identification at the tum-of-the-century 
would take us too far afield (although see Yeblen (1889)), we can certainly 
agree that many more of us are like Cecily than like Phyllis. Estimates for 
the frequency of spelling errors in human typed text vary from 0.05% of the 
words in carefully edited newswire text to 38% in difficult applications like 
telephone directory lookup (Kukich, 1992). 

In this chapter we discuss the problem of detecting and correcting 
spelling errors and the very related problem of modeling pronunciation vari¬ 
ation for automatic speech recognition and text-to-speech systems. On the 
surface, the problems of finding spelling errors in text and modeling the vari- 
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able pronunciation of words in spoken language don’t seem to have much 
in common. But the problems turn out to be isomorphic in an important 
way: they can both be viewed as problems of probabilistic transduction. For 
speech recognition, given a string of symbols representing the pronunciation 
of a word in context, we need to figure out the string of symbols represent¬ 
ing the lexical or dictionary pronunciation, so we can look the word up in the 
dictionary. But any given surface pronunciation is ambiguous; it might corre¬ 
spond to different possible words. For example the ARPAbet pronunciation 
[er] could correspond to reduced forms of the words her , were, are , their , 
or your. This ambiguity problem is heightened by pronunciation varia¬ 
tion; for example the word the is sometimes pronounced THEE and some¬ 
times THUH; the word because sometimes appears as because , sometimes 
as ’cause. Some aspects of this variation are systematic; Section 5.7 will sur¬ 
vey the important kinds of variation in pronunciation that are important for 
speech recognition and text-to-speech, and present some preliminary rules 
describing this variation. High-quality speech synthesis algorithms need to 
know when to use particular pronunciation variants. Solving both speech 
tasks requires extending the transduction between surface phones and lexi¬ 
cal phones discussed in Chapter 4 with probabilistic variation. 

Similarly, given the sequence of letters corresponding to a mis-spelled 
word, we need to produce an ordered list of possible correct words. For 
example the sequence acress might be a mis-spelling of actress , or of cress , 
or of acres. We transduce from the ‘surface’ form acress to the various 
possible ‘lexical’ forms, assigning each with a probability; we then select 
the most probable correct word. 

In this chapter we first introduce the problems of detecting and correct¬ 
ing spelling errors, and also summarize typical human spelling error patterns. 
We then introduce the essential probabilistic architecture that we will use to 
solve both spelling and pronunciation problems: the Bayes Rule and the 
noisy channel model. The Bayes rule and its application to the noisy chan¬ 
nel model will play a role in many problems throughout the book, particu¬ 
larly in speech recognition (Chapter 7), part-of-speech tagging (Chapter 8), 
and probabilistic parsing (Chapter 12). 

The Bayes Rule and the noisy channel model provide the probabilistic 
framework for these problems. But actually solving them requires an algo¬ 
rithm. This chapter introduces an essential algorithm called the dynamic 
programming algorithm, and various instantiations including the Viterbi 
algorithm, the minimum edit distance algorithm, and the forward algo¬ 
rithm. We will also see the use of a probabilistic version of the finite-state 
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automaton called the weighted automaton. 


5.1 Dealing with Spelling Errors 

The detection and correction of spelling errors is an integral part of modem 
word-processors. The very same algorithms are also important in applica¬ 
tions in which even the individual letters aren’t guaranteed to be accurately 
identified: optical character recognition (OCR) and on-line handwriting ocr 
recognition. Optical character recognition is the term used for automatic 
recognition of machine or hand-printed characters. An optical scanner con¬ 
verts a machine or hand-printed page into a bitmap which is then passed to 
an OCR algorithm. 

On-line handwriting recognition is the recognition of human printed 
or cursive handwriting as the user is writing. Unlike OCR analysis of hand¬ 
writing, algorithms for on-line handwriting recognition can take advantage 
of dynamic information about the input such as the number and order of 
the strokes, and the speed and direction of each stroke. On-line handwrit¬ 
ing recognition is important where keyboards are inappropriate, such as in 
small computing environments (palm-pilot applications, etc) or in scripts 
like Chinese that have large numbers of written symbols, making keyboards 
cumbersome. 

In this chapter we will focus on detection and correction of spelling 
errors, mainly in typed text, but the algorithms will apply also to OCR and 
handwriting applications. OCR systems have even higher error rates than 
human typists, although they tend to make different errors than typists. For 
example OCR systems often misread ‘D’ as ‘O’ or ‘ri’ as ‘n’, producing 
‘mis-spelled’ words like derision for derision , or POQ Bach for PDQ Bach. 

The reader with further interest in handwriting recognition should consult 
sources such as Tappert et al (1990), Hu et al (1996), and Casey and Leco- 
linet (1996). 

Kukich (1992), in her survey article on spelling correction, breaks the 
field down into three increasingly broader problems: 

1. non-word error detection: detecting spelling errors which result in 
non-words (like graffe for giraffe ). 

2. isolated-word error correction: correcting spelling errors which re¬ 
sult in non-words, for example correcting graffe to giraffe , but looking 
only at the word in isolation. 
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3. context-dependent error detection and correction: Using the con¬ 
text to help detect and correct spelling errors even if they acciden¬ 
tally result in an actual word of English (real-word errors). This 
can happen from typographical errors (insertion, deletion, transposi¬ 
tion) which accidently produce a real word (e.g. there for three), or 
because the writer substituted the wrong spelling of a homophone or 
near-homophone (e.g. dessert for desert , or piece for peace). 

The next section will discuss the kinds of spelling-error patterns that 
occur in typed text and OCR and handwriting-recognition input. 


5.2 S pelling Error Patterns 

The number and nature of spelling errors in human typed text differs from 
those caused by pattern-recognition devices like OCR and handwriting rec¬ 
ognizers. Grudin (1983) found spelling error rates of between 1% and 3% 
in human typewritten text (this includes both non-word errors and real-word 
errors). This error rate goes down significantly for copy-edited text. The 
rate of spelling errors in handwritten text itself is similar; word error rates of 
between 1.5% and 2.5% have been reported (Kukich, 1992). 

The errors of OCR and on-line hand-writing systems vary. Yaeger et al. 
(1998) propose, based on studies that they warn are inconclusive, that the 
online printed character recognition on Apple Computer’s newton mes- 
S agepad has a word accuracy rate of 97%-98%, i.e. an error rate of 2%-3%, 
but with a high variance (depending on the training of the writer, etc). OCR 
error rates also vary widely depending on the quality of the input; (Lopresti 
and Zhou, 1997) suggest that OCR letter-error rates typically range from 
0.2% for clean, first-generation copy to 20% or worse for multigeneration 
photocopies and faxes. 

In an early study, Damerau (1964) found that 80% of all misspelled 
words (non-word errors) in a sample of human keypunched text were caused 
by single-error misspellings: a single one of the following errors: 1 

insertion • insertion: mistyping the as ther 

deletion • deletion: mistyping the as th 

tion stitu ' • substitution: mistyping the as thw 


1 In another corpus, Peterson (1986) found that single-error misspellings accounted for an 
even higher percentage of all misspelled words (93%-95%). The difference between the 80% 
and the higher figure may be due to the fact that Damerau’s text included errors caused in 
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• transposition: mistyping the as hte 

Because of this study, much following research has focused on the 
correction of single-error misspellings. Indeed, the first algorithm we will 
present later in this chapter relies on the large proportion of single-error mis¬ 
spellings. 

Kukich (1992) breaks down human typing errors into two classes. Ty¬ 
pographic errors (for example misspelling spell as speel ), are generally 
related to the keyboard. Cognitive errors (for example misspelling sepa¬ 
rate as seperate) are caused by writers who not not know how to spell the 
word. Grudin (1983) found that the keyboard was the strongest influence on 
the errors produced; typographic errors constituted the majority of all error 
types. For example consider substitution errors, which were the most com¬ 
mon error type for novice typists, and the second most common error type 
for expert typists. Grudin found that immediately adjacent keys in the same 
row accounted for 59% of the novice substitutions and 31% of the error sub¬ 
stitutions (e.g. smsll for small). Adding in errors in the same column and 
homologous errors (hitting the corresponding key on the opposite side of 
the keyboard with the other hand), a total of 83% of the novice substitutions 
and 51% of the expert substitutions could be considered keyboard-based er¬ 
rors. Cognitive errors included phonetic errors (substituting a phonetically 
equivalent sequence of letters (seperate for separate) and homonym errors 
(substituting piece for peace). Homonym errors will be discussed in Chap¬ 
ter 7 when we discuss real-word error correction. 

While typing errors are usually characterized as substitutions, inser¬ 
tions, deletions, or transpositions, OCR errors are usually grouped into five 
classes: substitutions, multisubstitutions, space deletions or insertions, and 
failures. Lopresti and Zhou (1997) give the following example of common 
OCR errors: 

Correct: 

The quick brown fox jumps over the lazy dog. 

Recognized: 

The q~ ick brown foxjumps over tb 1 azy dog. 

Substitutions (e —> c) are generally caused by visual similarity (rather 
than keyboard distance), as are multisubstitutions (T —> 7, m —>> rn , he 
b). Multisubstitutions are also often called framing errors. Failures (repre- 

transcription to punched card forms, errors in keypunching, and errors caused by paper tape 
equipment (!) in addition to purely human misspellings. 


TRANSPOSI¬ 

TION 
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sented by the tilde character ‘~ ’: u —> are cases where the OCR algorithm 

does not select any letter with sufficient accuracy. 


5.3 Detecting Non-Word Errors 

Detecting non-word errors in text, whether typed by humans or scanned, is 
most commonly done by the use of a dictionary. For example, the word 
foxjurnps in the OCR example above would not occur in a dictionary. Some 
early research (Peterson, 1986) had suggested that such spelling dictionar¬ 
ies would need to be kept small, because large dictionaries contain very rare 
words that resemble misspellings of other words. For example wont is a 
legitimate but rare word but is a common misspelling of won't. Similarly, 
veery (a kind of thrush) might also be a misspelling of very. Based on a sim¬ 
ple model of single-error misspellings, Peterson showed that it was possible 
that 10% of such misspellings might be ‘hidden’ by real words in a 50,000 
word dictionary, but that 15% of single-error misspellings might be ‘hidden’ 
in a 350,000 word dictionary. In practice, Damerau and Mays (1989) found 
that this was not the case; while some misspellings were hidden by real 
words in a larger dictionary, in practice the larger dictionary proved more 
help than harm. 

Because of the need to represent productive inflection (the -s and ed 
suffixes) and derivation, dictionaries for spelling error detection usually in¬ 
clude models of morphology, just as the dictionaries for text-to-speech we 
saw in Chapter 3 and Chapter 4. Early spelling error detectors simply al¬ 
lowed any word to have any suffix - thus Unix spell accepts bizarre pre¬ 
fixed words like misclam and antiundoggingly and suffixed words based on 
the like thehood and theness. Modem spelling error detectors use more 
linguistically-motivated morphological representations (see Chapter 3). 


5.4 Probabilistic Models 

This section introduces probabilistic models of pronunciation and spelling 
variation. These models, particularly the Bayesian inference or noisy chan¬ 
nel model, will be applied throughout this book to many different problems. 

We claimed earlier that the problem of ASR pronunciation modeling, 
and the problem of spelling correction for typing or for OCR, can be modeled 
as problems of mapping from one string of symbols to another. For speech 
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recognition, given a string of symbols representing the pronunciation of a 
word in context, we need to figure out the string of symbols representing 
the lexical or dictionary pronunciation, so we can look the word up in the 
dictionary. Similarly, given the incorrect sequence of letters in a mis-spelled 
word, we need to figure out the correct sequence of letters in the correctly- 
spelled word. 



The intuition of the noisy channel model (see Figure 5.1) is to treat channel 
the surface form (the ‘reduced’ pronunciation or misspelled word) as an in¬ 
stance of the lexical form (the ‘lexical’ pronunciation or correctly-spelled 
word) which has been passed through a noisy communication channel. This 
channel introduces ‘noise’ which makes it hard to recognize the ‘true’ word. 

Our goal is then to build a model of the channel so that we can figure out how 
it modified this ‘true’ word and hence recover it. For the complete speech 
recognition tasks, there are many sources of ‘noise’; variation in pronun¬ 
ciation, variation in the realization of phones, acoustic variation due to the 
channel (microphones, telephone networks, etc). Since this chapter focuses 
on pronunciation, what we mean by ‘noise’ here is the variation in pronun¬ 
ciation that masks the lexical or ‘canonical’ pronunciation; the other sources 
of noise in a speech recognition system will be discussed in Chapter 7. For 
spelling error detection, what we mean by noise is the spelling errors which 
mask the correct spelling of the word. The metaphor of the noisy channel 
comes from the application of the model to speech recognition in the IBM 
labs in the 70’s (Jelinek, 1976). But the algorithm itself is a special case of 
Bayesian inference and as such has been known since the work of Bayes bayesian 
(1763). Bayesian inference or Bayesian classification was applied success¬ 
fully to language problems as early as the late 1950’s, including the OCR 
work of Bledsoe in 1959, and the seminal work of Mosteller and Wallace 
(1964) on applying Bayesian inference to determine the authorship of the 
Federalist papers. 

In Bayesian classification, as in any classification task, we are given 
some observation and our job is to determine which of a set of classes it 
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V 

w 

o 


belongs to. For speech recognition, imagine for the moment that the ob¬ 
servation is the string of phones which make up a word as we hear it. For 
spelling error detection, the observation might be the string of letters that 
constitute a possibly-misspelled word. In both cases, we want to classify 
the observations into words; thus in the speech case, no matter which of the 
many possible ways the word about is pronounced (see Chapter 4) we want 
to classify it as about. In the spelling case, no matter how the word separate 
is misspelled, we’d like to recognize it as separate. 

Let’s begin with the pronunciation example. We are given a string of 
phones (say [ni]). We want to know which word corresponds to this string 
of phones. The Bayesian interpretation of this task starts by considering all 
possible classes — in this case, all possible words. Out of this universe of 
words, we want to chose the word which is most probable given the ob¬ 
servation we have ([ni]). In other words, we want, out of all words in the 
vocabulary V the single word such that P(word|observation) is highest. We 
use w to mean ‘our estimate of the correct w’, and we’ll use O to mean ‘the 
observation sequence [ni]’ (we call it a sequence because we think of each 
letter as an individual observation). Then the equation for picking the best 
word given is: 


w = argmaxP(w|0) 

vvGV 


(5.1) 


The function argmax A f(x) means ‘the x such that f(x) is maximized’. 
While (5.1) is guaranteed to give us the optimal word w, it is not clear how 
to make the equation operational; that is, for a given word w and observation 
sequence O we don’t know how to directly compute P(w\0). The intuition of 
Bayesian classification is to use Bayes’ rule to transform (5.1) into a product 
of two probabilities, each of which turns out to be easier to compute than 
P(w\0). Bayes’ rule is presented in (5.2); it gives us a way to break down 
P{x\0) into three other probabilities: 

P{y\x)P{x) 


P(x\y) = 


p(y) 


We can see this by substituting (5.2) into (5.1) to get (5.3): 
P(0|w)P(w) 


w = argmax - 

wEV 


p(O) 


(5.2) 


(5.3) 


The probabilities on the right hand side of (5.3) are for the most part 
easier to compute than the probability P{w\0) which we were originally try¬ 
ing to maximize in (5.1). For example, P(w), the probability of the word 
itself, we can estimate by the frequency of the word. And we will see below 
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that P{0\w) turns out to be easy to estimate as well. But P(0 ), the probabil¬ 
ity of the observation sequence, turns out to be harder to estimate. Luckily, 
we can ignore P{0). Why? Since we are maximizing over all words, we will 
be computing for each word. But P(O) doesn’t change for each 

word; we are always asking about the most likely word string for the same 
observation 0, which must have the same probability P(O). Thus: 

w = argmax _ argmax/ 5 (0|>v) P(w) (5.4) 

wev P\0) wev 

To summarize, the most probable word w given some observation 0 
can be computing by taking the product of two probabilities for each word, 
and choosing the word for which this product is greatest. These two terms 
have names; P(w) is called the Prior probability, and P(0\w) is called the 

likelihood. 


PRIOR 

LIKELIHOOD 


Key Concept #3. 


likelihood prior 
w = argmax P{0\w) P(w) 

wEV 


(5.5) 


In the next sections we will show how to compute these two probabili¬ 
ties for the probabilities of pronunciation and spelling. 


5.5 Applying the Bayesian method to spelling 


There are many algorithms for spelling correction; we will focus on the 
Bayesian (or noisy channel) algorithm because of its generality. Chapter 6 
will show how this algorithm can be extended to model real-word spelling 
errors; this section will focus on non-word spelling errors. The noisy chan¬ 
nel approach to spelling correction was first suggested by Kemighan et al. 
(1990); their program, correct , takes words rejected by the Unix spell pro¬ 
gram, generates a list of potential correct words, rank them according to 
Equation (3), and picks the highest-ranked one. 

Let’s walk through the algorithm as it applies to Kemighan et aV s 
(1990) example misspelling acress. The algorithm has two stages: proposing 
candidate corrections and scoring the candidates. 

In order to propose candidate corrections Kemighan et al make the 
simplifying assumption that the correct word will differ from the misspelling 
by a single insertion, deletion, substitution, or transposition. As Damerau’s 
(1964) results show, even though this assumption causes the algorithm to 
miss some corrections, it should handle most spelling errors in human typed 



148 


Chapter 5. Probabilistic Models of Pronunciation and Spelling 


text. The list of candidate words is generated from the typo by applying any 
single transformation which results in a word in a large on-line dictionary. 
Applying all possible transformations to acress yields the list of candidate 
words in Figure 5.2. 





Transformation 



Correct 

Error 

Position 


Error 

Correction 

Letter 

Letter 

(Letter #) 

Type 

acress 

actress 

t 

- 

2 

deletion 

acress 

cress 

- 

a 

0 

insertion 

acress 

caress 

ca 

ac 

0 

transposition 

acress 

access 

c 

r 

2 

substitution 

acress 

across 

0 

e 

3 

substitution 

acress 

acres 

- 

2 

5 

insertion 

acress 

acres 

- 

2 

4 

insertion 

Figure 5.2 Candidate corrections for the misspelling acress , together with 
the transformations that would have produced the error, after Kemighan et al 

(1990). represents a null letter. 





The second stage of the algorithm scores each correction by Equa¬ 
tion 5.4. Let t represent the typo (the misspelled word), and let c range over 
the set C of candidate corrections. The most likely correction is then: 

likelihood prior 

c = argmax P(t\c) P(c) (5.6) 

c£C 

As in Equation 5.4 we have omitted the denominator in Equation 5.6 
since the typo t, and hence its probability P(t), is constant for all c. The 
prior probability of each correction P(c) can be estimated by counting how 
normalizing often the word c occurs in some corpus, and then normalizing these counts 
by the total count of all words. 2 So the probability of a particular correction 
word c is computed by dividing the count of c by the number N of words 
in the corpus. Zero counts can cause problems, and so we will add .5 to all 
the counts. This is called ‘smoothing’, and will be discussed in Chapter 6; 
note that in Equation 5.7 we can’t just divide by the total number of words 
N since we added .5 to the counts of all the words, so we add .5 for each of 

2 Normalizing means dividing by some total count so that the resulting probabilities fall 
legally between 0 and L 



Section 5.5. Applying the Bayesian method to spelling 


149 


the V words in the vocabulary). 
C(c) + 0.5 


P(c) = 


A + 0.5 V 


(5.7) 


Chapter 6 will talk more about the role of corpora in computing prior 
probabilities; for now let’s use the corpus of Kemighan et al (1990), which 
is the 1988 AP newswire corpus of 44 million words. Thus N is 44 million. 
Since in this corpus, the word actress occurs 1343 times, the word acres 
2879 times, and so on, the resulting prior probabilities are as follows: 


c 

freq(c) p(c) 

actress 

1343 

.0000315 

cress 

0 

.000000014 

caress 

4 

.0000001 

access 

2280 

.000058 

across 

8436 

.00019 

acres 

2879 

.000065 


Computing the likelihood term p(t\c) exactly is an unsolved (unsolve- 
able?) research problem; the exact probability that a word will be mistyped 
depends on who the typist was, how familiar they were with the keyboard 
they were using, whether one hand happened to be more tired than the other, 
etc. Luckily, while p(t\c) cannot be computed exactly, it can be estimated 
pretty well, because the most important factors predicting an insertion, dele¬ 
tion, transposition are simple local factors like the identity of the correct 
letter itself, how the letter was misspelled, and the surrounding context. For 
example, the letters m and n are often substituted for each other; this is partly 
a fact about their identity (these two letters are pronounced similarly and 
they are next to each other on the keyboard), and partly a fact about context 
(because they are pronounced similarly, they occur in similar contexts). 

One simple way to estimate these probabilities is the one that Kemighan 
et al (1990) used. They ignored most of the possible influences on the prob¬ 
ability of an error and just estimated e.g. p{acress\across) using the number 
of times that e was substituted for o in some large corpus of errors. This is 
represented by a confusion matrix, a square 26x26 table which represents 
the number of times one letter was incorrectly used instead of another. For 
example, the cell labeled [o,e] in a substitution confusion matrix would give 
the count of times that e was substituted for o. The cell labeled [t, s] in an 
insertion confusion matrix would give the count of times that t was inserted 
after s. A confusion matrix can be computed by hand-coding a collection 
of spelling errors with the correct spelling and then counting the number 


CONFUSION 

MATRIX 
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of times different errors occurred (this has been done by Grudin (1983)). 
Kemighan et al (1990) used four confusion matrices, one for each type of 
single-error: 

• del[x,y] contains the number of times in the training set that the char¬ 
acters xy in the correct word were typed as x. 

• ins[x,y] contains the number of times in the training set that the char¬ 
acter x in the correct word was typed as xy. 

• sub[x,y] the number of times that x was typed as y 

• trans[x,y] the number of times that xy was typed as yx. 


Note that they chose to condition their insertion and deletion proba¬ 
bilities on the previous character; they could also have chosen to condition 
on the following character. Using these matrices, they estimated p(t\c) as 
follows (where c p is the p th character of the word c)\ 


p(t\c) = ! 


del[cp_i,cp] 

cownt[c p -ic p ] 


, if deletion 


ins[c p _i,^] ;r insertion 
counted ’ 11 mseruon 

fSifegj’ if substitution 
SfeSJ-if transposition 


(5.8) 


Figure 5.3 shows the final probabilities for each of the potential correc¬ 
tions; the prior (from Equation 5.7) is multiplied by the likelihood (computed 
using Equation 5.8 and the confusion matrices). The final column shows the 
‘normalized percentage’. 


c 

freq(c) 

P(c) 

p(t|c) 

p(t|c)p(c) 

% 

actress 

1343 

.0000315 

.000117 

3.69 x 10 -9 

37% 

cress 

0 

.000000014 

.00000144 

2.02 x 10“ 14 

0% 

caress 

4 

.0000001 

.00000164 

1.64 x 10“ 13 

0% 

access 

2280 

.000058 

.000000209 

1.21 x 10 -11 

0% 

across 

8436 

.00019 

.0000093 

1.77 x 10“ 9 

18% 

acres 

2879 

.000065 

.0000321 

2.09 x 10“ 9 

21% 

acres 

2879 

.000065 

.0000342 

2.22 x 10“ 9 

23% 

Figure 5.3 Computation of the ranking for each candidate correction. Note 
that the highest ranked word is not actress but acres (the two lines at the bottom 
of the table), since acres can be generated in two ways. The del[ 7, ins/ /, sub[ 7, 
and transU confusion matrices are given in full in Kemighan et al. (1990). 
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This implementation of the Bayesian algorithm predicts acres as the 
correct word (at a total normalized percentage of 45%), and actress as the 
second most likely word. Unfortunately, the algorithm was wrong here: the 
writer’s intention becomes clear from the context: ... was called a “stellar 
and versatile acress whose combination of sass and glamour has defined 
her. .. ”. The surrounding words make it clear that actress and not acres was 
the intended word; Chapter 6 will show how to augment the computation of 
the prior probability to use the surrounding words. 

The algorithm as we have described it requires hand-annotated data to 
train the confusion matrices. An alternative approach used by Kemighan 
et al (1990) is to compute the matrices by iteratively using this very spelling 
error correction algorithm itself. The iterative algorithm first initializes the 
matrices with equal values; thus any character is equally likely to be deleted, 
equally likely to be substituted for any other character, etc. Next the spelling 
error correction algorithm is run on a set of spelling errors. Given the set 
of typos paired with their corrections, the confusion matrices can now be 
recomputed, the spelling algorithm run again, and so on. This clever method 
turns out to be an instance of the important EM algorithm (Dempster et al , 
1977) that we will discuss in Chapter 7 and Appendix D. Kemighan et al 
(1990)’s algorithm was evaluated by taking some spelling errors that had 
two potential corrections, and asking three human judges to pick the best 
correction. Their program agreed with the majority vote of the human judges 
87% of the time. 

5.6 Minimum Edit Distance 


The previous section showed that the Bayesian algorithm, as implemented 
with confusion matrices, was able to rank candidate corrections. But Kemighan 
et al (1990) relied on the simplifying assumption that each word had only a 
single spelling error. Suppose we wanted a more powerful algorithm which 
could handle the case of multiple errors? We could think of such an algo¬ 
rithm as a general solution to the problem of string distance. The ‘string distance 
distance’ is some metric of how alike two strings are to each other. The 
Bayesian method can be viewed as a way of applying such an algorithm to 
the spelling error correction problem; we pick the candidate word which is 
‘closest’ to the error in the sense of having the highest probability given the 
error. 

One of the most popular classes of algorithms for finding string dis- 
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MINIMUM EDIT 
DISTANCE 


ALIGNMENT 


tance are those that use some version of the minimum edit distance algo¬ 
rithm, named by Wagner and Fischer (1974) but independently discovered 
by many people; see the History section. The minimum edit distance be¬ 
tween two strings is the minimum number of editing operations (insertion, 
deletion, substitution) needed to transform one string into another. For ex¬ 
ample the gap between intention and execution is 5 operations, which can be 
represented in three ways; as a trace, an alignment, or a operation list as 
show in Figure 5.4. 



intentio 

n 
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executio 
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o n 

Figure 5.4 Three methods for representing differences between sequences 
(after Kruskal (1983)) 


We can also assign a particular cost or weight to each of these oper¬ 
ations. The Levenshtein distance between two sequences is the simplest 
weighting factor in which each of the three operations has a cost of 1 (Lev¬ 
enshtein, 1966). Thus the Levenshtein distance between intention and exe¬ 
cution is 5. Levenshtein also proposed an alternate version of his metric in 
which each insertion or deletion has a cost of one, and substitutions are not 
allowed (equivalent to allowing substitution, but giving each substitution a 
cost of 2, since any substitution can be represented by 1 insertion and 1 dele¬ 
tion). Using this version, the Levenshtein distance between intention and 
execution is 8. We can also weight operations by more complex functions, 
for example by using the confusion matrices discussed above to assign a 
probability to each operation. In this case instead of talking about the ‘mini¬ 
mum edit distance’ between two strings, we are talking about the ‘maximum 
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probability alignment’ of one string with another. If we do this, an aug¬ 
mented minimum edit distance algorithm which multiplies the probabilities 
of each transformation can be used to estimate the Bayesian likelihood of a 
multiple-error typo given a candidate correction. 

The minimum edit distance is computed by dynamic programming, prmiram- 
Dynamic programming is the name for a class of algorithms, first introduced 
by Bellman (1957), that apply a table-driven method to solve problems by 
combining solutions to subproblems. This class of algorithms includes the 
most commonly-used algorithms in speech and language processing, among 
them the minimum edit distance algorithm for spelling error correction the 
Viterbi algorithm and the forward algorithm which are used both in speech 
recognition and in machine translation, and the CYK and Earley algorithm 
used in parsing. We will introduce the minimum-edit-distance, Viterbi, and 
forward algorithms in this chapter and Chapter 7, the Earley algorithm in 
Chapter 10, and the CYK algorithm in Chapter 12. 

The intuition of a dynamic programming problem is that a large prob¬ 
lem can be solved by properly combining the solutions to various subprob¬ 
lems. For example, consider the sequence or ‘path’ of transformed words 
that comprise the minimum edit distance between the strings intention and 
execution. Imagine some string (perhaps it is exention ) that is in this opti¬ 
mal path (whatever it is). The intuition of dynamic programming is that if 
exention is in the optimal operation-list, then the optimal sequence must also 
include the optimal path from intention to exention. Why? If there were a 
shorter path from intention to exention then we could use it instead, resulting 
in a shorter overall path, and the optimal sequence wouldn’t be optimal, thus 
leading to a contradiction. 

Dynamic programming algorithms for sequence comparison work by 
creating a distance matrix with one column for each symbol in the target se¬ 
quence and one row for each symbol in the source sequence (i.e. target along 
the bottom, source along the side). For minimum edit distance, this matrix 
is the edit-distance matrix. Each cell edit-distance[ij] contains the distance 
between the first i characters of the target and the first j characters of the 
source. Each cell can be computed as a simple function of the surrounding 
cells; thus starting from the beginning of the matrix it is possible to fill in 
every entry. The value in each cell is computing by taking the minimum of 
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the three possible paths through the matrix which arrive there: 

{ distance [i —l J] + ins-cost(targetj) 

distance [/ — 1, j — 1] + subst-cost(sourcetarget f ) (5.9) 
distance [/, j — 1] + ins-co st {source j)) 

The algorithm itself is summarized in Figure 5.5, while Figure 5.6 
shows the results of applying the algorithm to the distance between inten¬ 
tion and execution assuming the version of Levenshtein distance in which 
insertions and deletions each have a cost of 1 and substitutions have a cost 
of 2. 


function MiN-EDiT-DiSTANCE(torge/, source) returns min-distance 

n«-LENGTH(tergeO 
m <- Length (so urce) 

Create a distance matrix distance [n+l,m+l] 
distance [0,0] 0 

for each column i from 0 to n do 
for each row j from 0 to m do 

distanced, j]<- Min( distance[i—lj] + ins-cost(targetj ), 

distance[i—lj— 1] + subst-cost(source j, target,*), 
distanced J— 1] + ins-cost(source j)) 


Figure 5.5 The minimum edit distance algorithm, an example of the class 
of dynamic programming algorithms. 


5.7 English Pronunciation Variation 


... when any of the fugitives of Ephraim said: ‘Let me go over,’ the 
men of Gilead said unto him: ‘Art thou an Ephraimite?’ If he said: 
‘Nay’; then said they unto him: ‘Say now Shibboleth’; and he said 
‘Sibboleth’; for he could not frame to pronounce it right; then they laid 
hold on him, and slew him at the fords of the Jordan; 

Judges 12:5-6 

This passage from Judges is a rather gory reminder of the political 
importance of pronunciation variation. Even in our (hopefully less politi¬ 
cal) computational applications of pronunciation, it is important to correctly 
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Figure 5.6 Computation of minimum edit distance between intention and 

execution via algorithm of Figure 5.5, using Levenshtein distance with cost of 

1 for insertions or deletions, 2 for substitutions. Substitution of a character for 
itself has a cost of 0. 


model how pronunciations can vary. We have already seen that a phoneme 
can be realized as different allophones in different phonetic environments. 

We have also shown how to write rules and transducers to model these 
changes for speech synthesis. Unfortunately, these models significantly sim¬ 
plified the nature of pronunciation variation. In particular, pronunciation 
variation is caused by many factors in addition to the phonetic environment. 

This section summarizes some of these kinds of variation; the following sec¬ 
tion will introduce the probabilistic tools for modeling it. 

Pronunciation variation is extremely widespread. Figure 5.7 shows 
the most common pronunciations of the words because and about from the 
hand-transcribed Switchboard corpus of American English telephone con¬ 
versations. Note the wide variation in pronunciation for these two words 
when spoken as part of a continuous stream of speech. 

What causes this variation? There are two broad classes of pronunci¬ 
ation variation: lexical variation and allophonic variation. We can think variation 
of lexical variation as a difference in what segments are used to represent var?ation nic 
the word in the lexicon, while allophonic variation is a difference in how the 
individual segments change their value in different contexts. In Figure 5.7, 
most of the variation in pronunciation is allophonic; i.e. due to the influ¬ 
ence of the surrounding sounds, syllable structure, etc. But the fact that the 
word because can be pronounced either as monosyllabic 'cause or bisyllabic 
because is probably a lexical fact, having to do perhaps with the level of 
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because 

about 

IPA 

ARPAbet 

% 

IPA 

ARPAbet 

% 

[bikAz] 

[b iy k ah z] 

27% 

[obau] 

[ax b aw] 

32% 

[bikxz] 

[b ix k ah z] 

14% 

[obaut] 

[ax b aw t] 

16% 

[kAz] 

[k ah z] 

7% 

[bau] 

[b aw] 

9% 

[kaz] 

[k ax z] 

5% 

[Abau] 

[ix b aw] 

8% 

[bikoz] 

[b ix k ax z] 

4% 

[ibaut] 

[ix b aw t] 

5% 

[bikAz] 

[b ih k ah z] 

3% 

[ibae] 

[ix b ae] 

4% 

[bokAz] 

[b ax k ah z] 

3% 

[obaer] 

[ax b ae dx] 

3% 

[kuz] 

[k uh z] 

2% 

[baur] 

[b aw dx] 

3% 

[ks] 

[ks] 

2% 

[bae] 

[b ae] 

3% 

[kiz] 

[k ix z] 

2% 

[baut] 

[b aw t] 

3% 

[kiz] 

[k ih z] 

2% 

[obaur] 

[ax b aw dx] 

3% 

[bikAg] 

[b iy k ah zh] 

2% 

[obae] 

[ax b ae] 

3% 

[bikAs] 

[b iy k ah s] 

2% 

M 

[b aa] 

3% 

[bikA] 

[b iy k ah] 

2% 

[baer] 

[b ae dx] 

3% 

[bikaz] 

fb iy k aa z] 

2% 

[ibaur] 

[ix b aw dx] 

2% 

[az] 

[ax z] 

2% 

[ibat] 

[ix b aa t] 

2% 

Figure 5.7 The 16 most common pronunciations of because and about 

from the hand-transcribed Switchboard corpus of American English 

conver- 

sational telephone speech (Godfrey et al , 1992; Greenberg et al , 1996) 


SOCIOLIN- 

GUISTIC 


DIALECT 

VARIATION 


informality of speech. 

An important source of lexical variation (although it can also affect al- 
lophonic variation) is sociolinguistic variation. Sociolinguistic variation is 
due to extralinguistic factors such as the social identity or background of the 
speaker. One kind of sociolinguistic variation is dialect variation. Speak¬ 
ers of some deep-southern dialects of American English use a monophthong 
or near-monophthong [a] or [ae] instead of a diphthong in some words with 
the vowel [ai]. In these dialects rice is pronounced [ra:s]. African-American 
Vernacular English (AAVE) has many of the same vowel differences from 
General American as does Southern American English, and also has indi¬ 
vidual words with specific pronunciations such as [bidms] for business and 
[aeks] for ask. For older speakers or those not from the American West or 
Midwest, the words caught and cot have different vowels ([kat] and [kat] 
respectively). Young American speakers or those from the West pronounce 
the two words cot and caught the same; the vowels [a] and [a] are usually 
not distinguished in these dialects. For some speakers from New York City 
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like the first author’s parents, the words Mary , ([meiri]) marry , ([maeri]) and 
merry ([meri]) are all pronounced differently, while other New York City 
speakers like the second author pronounce Mary , and merry identically, but 
differently than marry. Most American speakers pronounce all three of these 
words identically as ([meri]). Students who are interested in dialects of En¬ 
glish should consult Wells (1982), the most comprehensive study of dialects 
of English around the world. 

Other sociolinguistic differences are due to register or style rather than 
dialect. In a pronunciation difference that is due to style, the same speaker 
might pronounce the same word differently depending on who they were 
talking to or what the social situation is; this is probably the case when 
choosing between because and 'cause above. One of the most well-studied 
examples of style-variation is the suffix -ing (as in something ), which can be 
pronounced [irj] or /in/ (this is often written somethin'). Most speakers use 
both forms; as Labov (1966) shows, they use [irj] when they are being more 
formal, and [in] when more casual. In fact whether a speaker will use [irj] 
or [in] in a given situation varies markedly according to the social context, 
the gender of the speaker, the gender of the other speaker, etc. Wald and 
Shopen (1981) found that men are more likely to use the non-standard form 
[in] than women, that both men and women are more likely to use more of 
the standard form [irj] when the addressee is a women, and that men (but not 
women) tend to switch to [in] when they are talking with friends. 

Where lexical variation happens at the lexical level, allophonic varia¬ 
tion happens at the surface form and reflects phonetic and articulatory fac¬ 
tors. 3 For example, most of the variation in the word about in Figure 5.7 
was caused by changes in one of the two vowels or by changes to the final 
[t]. Some of this variation is due to the allophonic rules we have already 
discussed for the realization of the phoneme /t/. For example the pronun¬ 
ciation of about as [obaur]/[ax b aw dx]) has a flap at the end because the 
next word was the word it , which begins with a vowel; the sequence about 
it was pronounced [obauri]/[ax b aw dx ix]). Similarly note that final [t] is 
often deleted; (about as [bau]/[b aw]). Considering these cases as ‘deleted’ 
is actually a simplification; many of these ‘deleted’ cases of [t] are actually 
realized as a slight change to the vowel quality called glottalization which 
are not represented in these transcriptions. 


3 Many linguists distinguish between allophonic variation and what are called ‘optional 
phonological rules’; for the purposes of this textbook we will lump these both together as 
‘allophonic variation’. 


REGISTER 

STYLE 
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COARTICULA¬ 

TION 


ASSIMILATION 


PALATALIZA¬ 

TION 


DELETION 


When we discussed these rules earlier, we implied that they were de¬ 
terministic; given an environment, a rule always applies. This is by no means 
the case. Each of these allophonic rules is dependent on a complicated set of 
factors that must be interpreted probabilistically. In the rest of this section 
we summarize more of these rules and talk about the influencing factors. 
Many of these rules model coarticulation, which is a change in a segment 
due to the movement of the articulators in neighboring segments. Most al¬ 
lophonic rules relating English phoneme to their allophones can be grouped 
into a small number of types: assimilation, dissimilation, deletion, flapping, 
vowel reduction, and epenthesis. 

Assimilation is the change in a segment to make it more like a neigh¬ 
boring segment. The dentalization of [t] to ([t]) before the dental consonant 
[0] is an example of assimilation. Another common type of assimilation 
in English and cross-linguistically is palatalization. Palatalization occurs 
when the constriction for a segment occurs closer to the palate than it nor¬ 
mally would, because the following segment is palatal or alveolo-palatal. 
In the most common cases, /s/ becomes [J], /z/ becomes [g], N becomes [tj 
and Id/ becomes dg]. We saw one case of palatalization in Figure 5.7 in the 
pronunciation of because as [bikxg] (ARPAbet [b iy k ah zh]). Here the 
final segment of because , a lexical /z/, is realized as [g], because the fol¬ 
lowing word was you've. So the sequence because you've was pronounced 
[bikAguv]. A simple version of a palatalization rule might be expressed as 
follows; Figure 5.8 shows examples from the Switchboard corpus. 


M 

[z] 

M 

lid] 


=> < 


[J] 

[3] 

[tj] 

[d3] 




(5.10) 


Note in Figure 5.8 that whether a [t] is palatalized depends on lexical 
factors like word frequency ([t] is more likely to be palatalized in frequent 
words and phrases). 

Deletion is quite common in English speech. We saw examples of 
deletion of final Itl above, in the words about and it. /t/ and /d/ are often 
deleted before consonants, or when they are part of a sequence of two or 
three consonants; Figure 5.9 shows some examples. 

=^0/V _C (5.11) 

The many factors that influence the deletion of /t/ and /d/ have been 
extensively studied. For example /d/ is more likely to be deleted than /t/. 
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IPA 

IPA 

ARPAbet 

Phrase 

Lexical 

Reduced 

Reduced 

set your 

[setjar] 

[setj>] 

[s eh ch er] 

not yet 

[natjet] 

[natjet] 

[n aa ch eh t] 

last year 

[leestjir] 

[laestjir] 

[1 ae s ch iy r] 

what you 

[wAtju] 

[vratju] 

[w ax ch uw] 

this year 

[disjir] 

[3ijir] 

[dh ih sh iy r] 

because you’ve 

[bikAzjuv] 

[bikAguv] 

[b iy k ah zh uw v] 

did you 

[didju] 

[did3yA] 

[d ih jh y ah] 

Figure 5.8 Examples of palatalization from the Switchboard corpus; the 

lemma you (including your , you've, and you’d) was 

by far the most common 

cause of palatalization, followed by year(s) (especially in the phrases this year 

and last year). 





Phrase 

IPA 

Lexical 

IPA 

Reduced 

ARPAbet 

Reduced 

find him 

around this 
mind boggling 
most places 
draft the 

left me 

[faindhim] 

[oraunddis] 

[mairitxKjlnj] 

[moustpleisiz] 

[draeftdi] 

[lcftmi] 

[fainim] 

[iraums] 

[mainboglirj] 

[mouspleisiz] 

[draefdi] 

[lefmi] 

[f ay n ix m] 

[ix r aw n ih s] 

[m ay n b ao g el ih ng] 
[m ow s p 1 ey s ix z] 

[d r ae f dh iy] 

[1 eh f m iy] 

Figure 5.9 Examples of III and fdf deletion from Switchboard. Some of 
these examples may have glottalization instead of being completely deleted. 


Both are more likely to be deleted before a consonant (Labov, 1972). The 
final /tl and Id/ in the words and and just are particularly likely to be deleted 
(Labov, 1975; Neu, 1980). Wolfram (1969) found that deletion is more 
likely in faster or more casual speech, and that younger people and males 
are more likely to delete. Deletion is more likely when the two words sur¬ 
rounding the segment act as a sort of phrasal unit, either occurring together 
frequently (Bybee, 1996), having a high mutual information or trigram 
predictability (Gregory et al , 1999), or being tightly connected for other 
reasons (Zwicky, 1972). Fasold (1972), Labov (1972), and many others have 
shown that deletion is less likely if the word-final /t/ or /d/ is the past tense 
ending. For example in Switchboard, deletion is more likely in the word 
around (73% /d/-deletion) than in the word turned (30% /d/-deletion) even 
though the two words have similar frequencies. 
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HYPERARTIC- 

ULATES 


REDUCED 

VOWELS 

SCHWA 


The flapping rule is significantly more complicated than we suggested 
in Chapter 4, as a number of scholars have pointed out (see especially Rhodes 
(1992)). The preceding vowel is highly likely to be stressed, although this is 
not necessary (for example there is commonly a flap in the word thermome¬ 
ter [Oa^'mamiraf 1 ]). The following vowel is highly likely to be unstressed, al¬ 
though again this is not necessary, /t/ is much more likely to flap than 
/d/. There are complicated interactions with syllable, foot, and word bound¬ 
aries. Flapping is more likely to happen when the speaker is speaking more 
quickly, and is more likely to happen at the end of a word when it forms 
a collocation (high mutual information) with the following word (Gregory 
et ah, 1999). Flapping is less likely to happen when a speaker hyperar- 
ticulates, i.e. uses a particularly clear form of speech, which often happens 
when users are talking to computer speech recognition systems (Oviatt et ah , 
1998). There is a nasal flap [f] whose tongue movements resemble the oral 
flap but in which the velum is lowered. Finally, flapping doesn’t always hap¬ 
pen, even when the environment is appropriate; thus the flapping rule, or 
transducer, needs to be probabilistic, as we will see below. 

We have saved for last one of the most important phonological pro¬ 
cesses: vowel reduction, in which many vowels in unstressed syllables are 
realized as reduced vowels, the most common of which is schwa ([o]). 
Stressed syllables are those in which more air is pushed out of the lungs; 
stressed syllables are longer, louder, and usually higher in pitch than un¬ 
stressed syllables. Vowels in unstressed syllables in English often don’t have 
their full form; the articulatory gesture isn’t as complete as for a full vowel. 
As a result the shape of the mouth is somewhat neutral; the tongue is nei¬ 
ther particularly high nor particularly low. For example the second vowels 
in parakeet is schwa: [paerokit]. 

While schwa is the most common reduced vowel, it is not the only 
one, at least not in some dialects. Bolinger (1981) proposed three reduced 
vowels: a reduced mid vowel [o], a reduced front vowel [i], and a reduced 
rounded vowel [e]. But the majority of computational pronunciation lexi¬ 
cons or computational models of phonology systems limit themselves to one 
reduced vowel ([o]) (for example PRONLEX and CELEX) or at most two 
([o] =ARPABET [ax] and [i] = ARPAbet [ix]). Miller (1998) was able to 
train a neural net to automatically categorize a vowel as [o] or [i] based only 
on the phonetic context, which suggests that for speech recognition and text- 
to-speech purposes, one reduced vowel is probably adequate. Indeed Wells 
(1982) (167-168) notes that [o] and [i] are falling together in many dialects of 
English including General American and Irish, among others, a phenomenon 
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he calls weak vowel merger. 

A final note: not all unstressed vowels are reduced; any vowel, and 
diphthongs in particular can retain their full quality even in unstressed po¬ 
sition. For example the vowel [ei] (ARPAbet [ey]) can appear in stressed 
position as in the word eight ) [’eit] or unstressed position as in the word al¬ 
ways [d.weiz]. Whether a vowel is reduced depends on many factors. For 
example the word the can be pronounced with a full vowel 3i or reduced 
vowel 5o. It is more likely to be pronounced with the reduced vowel 3o in 
fast speech, in more casual situations, and when the following word begins 
with a consonant. It is more likely to be pronounced with the full vowel 8i 
when the following word begins with a vowel or when the speaker is having 
‘planning problems’; speakers are more likely to use a full vowel than a re¬ 
duced one if they don’t know what they are going to say next (Fox Tree and 
Clark, 1997). See Keating et al (1994) and Jurafsky et al (1998) for more 
details on factors effecting vowel reduction in the TIMIT and Switchboard 
corpora. Other factors influencing reduction include the frequency of the 
word, whether this is the final vowel in a phrase, and even the idiosyncracies 
of individual speakers. 


5.8 The Bayesian method for pronunciation 


HEAD KNIGHT OF NI: 
KNIGHTS OF NI: 
ARTHUR: 

HEAD KNIGHT: 
RANDOM: 

ARTHUR: 

HEAD KNIGHT: 
BEDEYERE: 

HEAD KNIGHT: 


Ni! 

Ni! Ni! Ni! Ni! Ni! 

Who are you? 

We are the Knights Who Say... ‘Ni’! 

Ni! 

No! Not the Knights Who Say ’Ni’! 

The same! 

Who are they? 

We are the keepers of the sacred words: ‘Ni’, ‘Peng’, 
and ‘Neee-wom’! 

Graham Chapman, John Cleese, Eric Idle, Terry Gilliam, Terry Jones, 
and Michael Palin, Monty Python and the Holy Grail 1975. 


The Bayesian algorithm that we used to pick the optimal correction for 
a spelling error can be used to solve what is often called the pronunciation 
subproblem in speech recognition. In this task, we are given a series of 
phones and our job is to compute the most probable word which generated 
them. For this chapter, we will simplify the problem in an important way 
by assuming the correct string of phones. A real speech recognizer relies on 
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probabilistic estimators for each phone, so it is never sure about the identity 
of any phone. We will relax this assumption in Chapter 7; for now, let’s look 
at the simpler problem. 

We’ll also begin with another simplification by assuming that we al¬ 
ready know where the word boundaries are. Later in the chapter, we’ll show 
that we can simultaneously find word boundaries (‘segment’) and model pro¬ 
nunciation variation. 

Consider the particular problem of interpreting the sequence of phones 
[ni], when it occurs after the word I at the beginning of a sentence. Stop and 
see if you can think of any words which are likely to have been pronounced 
[ni] before you read on. The word “Ni” is not allowed. 

You probably thought of the word knee. This word is in fact pro¬ 
nounced [ni]. But an investigation of the Switchboard corpus produces a 
total of 7 words which can be pronounced [ni]! The seven words are the, 
neat, need, new, knee, to, and you. 

How can the word the be pronounced [ni]? The explanation for this 
pronunciation (and all the others except the one for knee ) lies in the contextually- 
induced pronunciation variation we discussed in Chapter 4. For example, we 
saw that [t] and [d] were often deleted word finally, especially before coro¬ 
nals; thus the pronunciation of neat as [ni] happened before the word little 
(neat little —> [nilol]). The pronunciation of the as [ni] is caused by the re¬ 
gressive assimilation process also discussed in Chapter 4. Recall that in nasal 
assimilation, phones before or after nasals take on nasal manner of articula¬ 
tion. Thus [0] can be realized as [n]. The many cases of the pronounced 
as [ni] in Switchboard occurred after words like in, on, and been (so in the 
[inni]). The pronunciation of new as [ni] occurred most frequently in the 
word New York; the vowel [u] has fronted to [i] before a [y]. 

The pronunciation of to as [ni] occurred after the work talking (talking 
to you —V [takiniyu]); here the [u] is palatalized by the following [y] and the 
[n] is functioning jointly as the final sound of talking and the initial sound 
of to. Because this phone is part of two separate words we will not try to 
model this particular mapping; for the rest of this section let’s consider only 
the following five words as candidate lexical forms for [ni]: knee, the, neat, 
need, new. 

We saw in the previous section that the Bayesian spelling error cor¬ 
rection algorithm had two components: candidate generation, and candidate 
scoring. Speech recognizers often use an alternative architecture, trading 
off speech for storage. In this architecture, each pronunciation is expanded 
in advance with all possible variants, which are then pre-stored with their 
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scores. Thus there is no need for candidate generation; the word [ni] is 
simply stored with the list of words that can generate it. Let’s assume this 
method and see how the prior and likelihood are computed for each word. 

We will be choosing the word whose product of prior and likelihood is 
the highest, according to Equation 5.12, where y represents the sequence of 
phones (in this case [ni] and w represents the candidate word (the, new , etc)). 
The most likely word is then: 

likelihood prior 

w = argmax P(y|w) P(w) (5.12) 

We could choose to generate the likelihoods p(y\w) by using a set of 
confusion matrices as we did for spelling error correction. But it turns out 
that confusion matrices don’t do as well for pronunciation as for spelling. 
While misspelling tends to change the form of a word only slightly, the 
changes in pronunciation between a lexical and surface form are much greater. 
Confusion matrices only work well for single-errors, which, as we saw above, 
are common in misspelling. Furthermore, recall from Chapter 4 that pro¬ 
nunciation variation is strongly affected by the surrounding phones, lexical 
frequency, and stress and other prosodic factors. Thus probabilistic models 
of pronunciation variation include a lot more factors than a simple confusion 
matrix can include. 

One simple way to generate pronunciation likelihoods is via proba¬ 
bilistic rules. Probabilistic rules were first proposed for pronunciation by 
(Labov, 1969) (who called them variable rules). The idea is to take the 
rules of pronunciation variation we saw in Chapter 4 and associate them 
with probabilities. We can then run these probabilistic rules over the lexicon 
and generate different possible surface forms each with its own probability. 
For example, consider a simple version of a nasal assimilation rule which 
explains why the can be pronounced [ni]; a word-initial [8] becomes [n] if the 
preceding word ended in [n] or sometimes [m]: 

[.15] 8 n / [+nasal\ #_ (5.13) 

The [.15] to the left of the rule is the probability; this can be com¬ 
puted from a large-enough labeled corpus such as the transcribed portion of 
Switchboard. Let ncount be the number of times lexical [8] is realized word- 
ini tially by surface [n] when the previous word ends in a nasal (91 in the 
Switchboard corpus). Let envcount be the total number of times lexical [3] 
occurs (whatever its surface realization) when the previous word ends in a 
nasal (617 in the Switchboard corpus). The resulting probability is: 


PROBABILIS¬ 

TIC 

RULES 
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P (3 —> n / [+nasat\ #_) 


ncount 


envcount 

91 

617 

.15 


We can build similar probabilistic versions of the assimilation and dele¬ 
tion rules which account for the [ni] pronunciation of the other words. Fig¬ 
ure 5.10 shows sample rules and the probabilities trained on the Switchboard 
pronunciation database. 


Word 

Rule Name 

Rule 

P 

the 

nasal assimilation 

5 n / [+nasal\ #_ 

[-15] 

neat 

final t deletion 

t^%jv _# 

[•52] 

need 

final d deletion 

d^Q/V _# 

[-11] 

new 

u fronting 

w => ? /_# £y] 

[•36] 

Figure 5.10 Simple rules of pronunciation variation due to context in con¬ 
tinuous speech accounting for the pronunciation of each of these words as [ni]. 


We now need to compute the prior probability P(w) for each word. 
For spelling correction we did this by using the relative frequency of the 
word in a large corpus; a word which occurred 44,000 times in 44 million 
words receives the probability estimate 44 4 qqq Q qqq or .001. For the pronuncia¬ 
tion problem, let’s take our prior probabilities from a collection of a written 
and a spoken corpus. The Brown Corpus is a 1 million word collection 
of samples from 500 written texts from different genres (newspaper, nov¬ 
els, non-fiction, academic, etc.) which was assembled at Brown University 
in 1963-64 (Kucera and Francis, 1967; Francis, 1979; Francis and Kucera, 
1982). The Switchboard Treebank corpus is a 1.4 million word collection 
of telephone conversations. Together they let us sample from both the writ¬ 
ten and spoken genres. The table below shows the probabilities for our five 
words; each probability is computed from the raw frequencies by normaliz¬ 
ing by the number of words in the combined corpus (plus .5 * the number of 
word types; so the total denominator is 2,486,075 + 30,836): 
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w freq(w) p(w) 


knee 

61 

.000024 

the 

114,834 

.046 

neat 

338 

.00013 

need 

1417 

.00056 

new 

2625 

.001 


Now we are almost ready to answer our original question: what is 
the most likely word given the pronunciation [ni] and given that the previous 
word was / at the beginning of a sentence. Let’s start by multiplying together 
our estimates for p{w) and p{y\w) to get an estimate; we show them sorted 
from most probable to least probable (the has a probability of 0 since the 
previous phone was not [n], and hence there is no other rule allowing [5] to 
be realized as [n]): 


Word p(y|w) p(w) 

p(y|w)p(w) 

new 

.36 .001 

.00036 

neat 

.52 .00013 

.000068 

need 

.11 .00056 

.000062 

knee 

1.00 .000024 

.000024 

the 

0 .046 

0 


Our algorithm suggests that new is the most likely underlying word. 
But this is the wrong answer; the string [ni] following the word / came in 
fact from the word need in the Switchboard corpus. One way that people 
are able to correctly solve this task is word-level knowledge; people know 
that the word string I need ... is much more likely than the word string I new 
.... We don’t need to abandon our Bayesian model to handle this fact; we 
just need to modify it so that our model also knows that I need is more likely 
than I new. In Chapter 6 we will see that we can do this by using a slightly 
more intelligent estimate of p{w) called a bigram estimate; essentially we 
consider the probability of need following / instead of just the individual 
probability of need. 

This Bayesian algorithm is in fact part of all modem speech recog¬ 
nizers. Where the algorithms differ strongly is how they detect individual 
phones in the acoustic signal, and on which search algorithm they use to 
efficiently compute the Bayesian probabilities to find the proper string of 
words in connected speech (as we will see in Chapter 7). 
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Decision Tree Models of Pronunciation Variation 

In the previous section we saw how hand-written rules could be augmented 
with probabilities to model pronunciation variation. Riley (1991) and With- 
gott and Chen (1993) suggested an alternative to writing rules by hand, 
which has proved quite useful: automatically inducing lexical-to-surface 
tree sion pronunciations mappings from a labeled corpus with a decision tree, partic¬ 
ularly with the kind of decision tree called a Classification and Regression 
cart Tree (CART) (Breiman et al , 1984). A decision tree takes a situation de¬ 
scribed by a set of features and classifies it into a category and an associated 
probability. For pronunciation, a decision tree can be trained to take a lexical 
phone and various contextual features (surrounding phones, stress and sylla¬ 
ble structure information, perhaps lexical identity) and select an appropriate 
surface phone to realize it. We can think of the confusion matrices we used 
in spelling error correction above as degenerate decision trees; thus the sub¬ 
stitution matrix takes a lexical phone and outputs a probability distribution 
over potential surface phones to be substituted. The advantage of decision 
trees is that they can be automatically induced from a labeled corpus, and 
that they are concise: decision trees pick out only the relevant features and 
thus suffer less from sparseness than a matrix which has to condition on 
every neighboring phone. 
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For example, Figure 5.11 shows a decision tree for the pronunciation 
of the phoneme /t/ induced from the Switchboard corpus. While this tree 
doesn’t including flapping (there is a separate tree for flapping) it does model 
the fact that /t/ is more likely to be deleted before a consonant than before 
a vowel. Note, in fact, that the tree automatically induced the classes Vowel 
and Consonant Furthermore note that if /t/ is not deleted before a conso¬ 
nant, it is likely to be unreleased. Finally, notice that /t/ is very unlikely to 
be deleted in syllable onset position. 

Readers with interest in decision tree modeling of pronunciation should 
consult Riley (1991), Withgott and Chen (1993), and a textbook with an in¬ 
troduction to decision trees such as Russell and Norvig (1995). 

5.9 Weighted Automata 

We said earlier that for purposes of efficiency a lexicon is often stored with 
the most likely kinds of pronunciation variation pre-compiled. The two most 
common representation for such a lexicon are the trie and the weighted 
finite state automaton/transducer (or probabilistic FSA/FST) (Pereira 
et al , 1994). We will leave the discussion of the trie to Chapter 7, and 
concentrate here on the weighted automaton. 

The weighted automaton is a simple augmentation of the finite automa¬ 
ton in which each arc is associated with a probability, indicating how likely 
that path is to be taken. The probability on all the arcs leaving a node must 
sum to 1. Figure 5.12 shows two weighted automata for the word tomato , 
adapted from Russell and Norvig (1995). The top automaton shows two pos¬ 
sible pronunciations, representing the dialect difference in the second vowel. 
The bottom one shows more pronunciations (how many?) representing op¬ 
tional reduction or deletion of the first vowel and optional flapping of the 
final [t]. 

A Markov chain is a special case of a weighted automaton in which 
the input sequence uniquely determines which states the automaton will go 
through. Because they can’t represent ambiguous problems, a Markov chain 
is only useful for assigning problems to unambiguous sequences, and hence 
isn’t often used in speech or language processing. In fact the weighted au¬ 
tomata used in speech and language processing can be shown to be equiva¬ 
lent to Hidden Markov Models (HMMs). Why do we introduce weighted 
automata in this chapter and HMMs in Chapter 7? The two models offer 
a different metaphor; it is sometimes easier to think about certain problems 


WEIGHTED 


MARKOV 

CHAIN 
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Word model with dialect variation: 



Word model with coarticulation and dialect variation: 



Figure 5.12 You say [t ow m ey t ow] and I say [t ow m aa t ow]. Two 
pronunciation networks for the word tomato , adapted from Russell and Norvig 
(1995). The top one models sociolinguistic variation (some British or eastern 
American dialects); the bottom one adds in coarticulatory effects. Note the 
correlation between allophonic and sociolinguistic variation; the dialect with 
the vowel [ey] is more likely to flap than the other dialect. 


as weighted-automata than as HMMs. The weighted automaton metaphor is 
often applied when the input alphabet maps relatively neatly to the under¬ 
lying alphabet. For example, in the problem of correcting spelling errors in 
typewritten input, the input sequence consists of letters and the states of the 
automaton can correspond to letters. Thus it is natural to think of the problem 
as transducing from a set of symbols to the same set of symbols with some 
modifications, and hence weighted automata are naturally used for spelling 
error correction. In the problem of correcting errors in hand-written input, 
the input sequence is visual, and the input alphabet is an alphabet of lines and 
angles and curves. Here instead of transducing from an alphabet to itself, we 
need to do classification on some input sequence before considering it as 
a sequence of states. Hidden Markov Models provide a more appropriate 
metaphor, since they naturally handle separate alphabets for input sequences 
and state sequences. But since any probabilistic automaton in which the in¬ 
put sequence does not uniquely specify the state sequence can be modeled as 
an HMM, the difference is one of metaphor rather than explanatory power. 

Weighted automata can be created in many ways. One way, first pro¬ 
posed by Cohen (1989) is to start with on-line pronunciation dictionaries and 
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use hand-written rules of the kind we saw above to create different potential 
surface forms. The probabilities can then be assigned either by counting 
the number of times each pronunciation occurs in a corpus, or if the cor¬ 
pus is too sparse, by learning probabilities for each rule and multiplying 
out the rule probabilities for each surface form (Tajchman et al , 1995). Fi¬ 
nally these weighted rules, or alternatively the decision trees we discussed 
in the last section, can be automatically compiled into a weighted finite-state 
transducer (Sproat and Riley, 1996). Alternatively, for very common words, 
we can simply find enough examples of the pronunciation in a transcribed 
corpus to build the model by just combining all the pronunciations into a 
network (Wooters and Stolcke, 1994). 

The networks for tomato above were shown merely as illustration and 
are not from any real system; Figure 5.13 shows an automaton for the word 
about which is trained on actual pronunciations from the Switchboard corpus 
(we discussed these pronunciations in Chapter 4). 



Computing Likelihoods from Weighted Automata: The Forward Algorithm 

One advantage of an automaton-based lexicon is that there are efficient al¬ 
gorithms for generating the probabilities that are needed to implement the 
Bayesian method of correct-word-identification of Section 5.8. These algo¬ 
rithms apply to weighted automata and also to the Hidden Markov Models 
that we will discuss in Chapter 7. Recall that in our example the Bayesian 
method is given as input a series of phones [n iy], and must choose between 
the words the , neat , need , new , and knee . This was done by computing two 
probabilities: the prior probability of each word, and the likelihood of the 
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phone string [n iy] given each word. When we discussed this example ear¬ 
lier, we said that for example the likelihood of [n iy] given the word need was 
.11, since we computed a probability of. 11 for the final-d-deletion rule from 
our Switchboard corpus. This probability is transparent for need since there 
were only two possible pronunciations ([n iy] and [n iy d]). But for words 
like about , visualizing the different probabilities is more complex. Using a 
precompiled weighted automata can make it simpler to see all the different 
probabilities of different paths through the automaton. 

There is a very simple algorithm for computing the likelihood of a 
string of phones given the weighted automaton for a word. This algorithm, 
forward the forward algorithm, is an essential part of ASR systems, although in this 
chapter we will only be working with a simple usage of the algorithm. This is 
because the forward algorithm is particularly useful when there are multiple 
paths through an automaton which can account for the input; this is not the 
case in the weighted automata in this chapter, but will be true for the HMMS 
of Chapter 7. The forward algorithm is also an important step in defining the 
Viterbi algorithm which we will see later in this chapter. 

Let’s begin by giving a formal definition of a weighted automaton and 
of the input and output to the likelihood computation problem. A weighted 
automaton consists of 

1. a sequence of states q = {qoq\ ?2 ■ ■ • ?«)? each corresponding to a phone, 

2. a set of transition probabilities between states, «oi,^i 2 ,^i 3 ? encoding 
the probability of one phone following another. 

We represent the states as nodes, and the transition probabilities as 
edges between nodes; an edge exists between two nodes if there is a non-zero 
transition probability between the two nodes. 4 The sequences of symbols 
that are input to the model (if we are thinking of it as recognizer) or which are 
produced by the model (if we are thinking of it as a generator) are generally 
sequence on called the observation sequence, referred to as O = ( 01 O 2 O 3 (Upper¬ 

case letters are used for a sequence and lower-case letters for an individual 

4 We have used two ‘special’ states (often called non-emitting states) as the start and end 
state; it is also possible to avoid the use of these states. In that case, an automaton must 
specify two more things: 

1. n, an initial probability distribution over states, such that 7t ( - is the probability that the 
automaton will start in state i. Of course some states j may have 7 lj = 0, meaning that 
they cannot be initial states. 

2. a set of legal accepting states. 
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element of a sequence). We will use this terminology when talking about 
weighted automata and later when talking about HMMs. 

Figure 5.14 shows an automaton for the word need with a sample ob¬ 
servation sequence. 


a 24 = .11 



Figure 5.14 A simple weighted automaton or Markov chain pronunciation 
network for the word need , showing the transition probabilities, and a sample 
observation sequence. The transition probabilities a xy between two states x 
and y are 1.0 unless otherwise specified. 


This task of determining which underlying word might have produced 
an observation sequence is called the decoding problem. Recall that in order decoding 
to find which of the candidate words was most probable given the observa¬ 
tion sequence [n iy], we need to compute the product P(0\w)P(w) for each 
candidate word (the, need, neat, knee, new), i.e. the likelihood of the ob¬ 
servation sequence O given the word w times the prior probability of the 
word. 

The forward algorithm can be run to perform this computation for each 
word; we give it an observation sequence and the pronunciation automaton 
for a word and it will return J P(0|iv)/ > (iv). Thus one way to solve the de¬ 
coding problem is to run the forward algorithm separately on each word and 
choose the word with the highest value. As we saw earlier, the Bayesian 
method produces the wrong result for pronunciation [n iy] as part of the 
word sequence I need (its first choice is the word new, and the second choice 
is neat; need is only the third choice). Since the forward algorithm is just 
a way of implementing the Bayesian approach, it will return the exact same 
rankings. (We will see in Chapter 6 how to augment the algorithm with bi¬ 
gram probabilities which will enable it to make use of the knowledge that 
the previous word was 7). 

The forward algorithm takes as input a pronunciation network for each 
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candidate word. Because the word the only has the pronunciation [n iy] after 
nasals, and since we are assuming the actual context of this word was after 
the word / (no nasal), we will skip that word and look only at new , neat , 
need , and knee. Note in Figure 5.15 that we have augmented each network 
with the probability of each word, computed from the frequency that we saw 
on page 165. 


.11 


f start j - j n )—a iy )—4 d )— 

^0 @^00^-0 

Word model for "need" 

Word model for "knee" 

{ start)- 4 n ) f end J 

.52 


Word model for "neat" 

Word model for "new" 


Figure 5.15 Pronunciation networks for the words need , neat , new , and 
knee. All networks are simplified from the actual pronunciations in the Switch¬ 
board corpus. Each network has been augmented by the unigram probability 
of the word (i.e. its normalized frequency from the Switchboard+Brown cor¬ 
pus). Word probabilities are not usually included as part of the pronunciation 
network for a word; they are added here to simplify the exposition of the for- 

ward algorithm. 



The forward algorithm is another dynamic programming algorithm, 
and can be thought of as a slight generalization of the minimum edit dis¬ 
tance algorithm. Like the minimum edit distance algorithm, it uses a table 
to store intermediate values as it builds up the probability of the observa¬ 
tion sequence. Unlike the minimum edit distance algorithm, the rows are 
labeled not just by states which always occur in linear order, but implicitly 
by a state-graph which has many ways of getting from one state to another. 
In the minimum edit distance algorithm, we filled in the matrix by just com¬ 
puting the value of each cell from the 3 cells around it. With the forward 
algorithm, on the other hand, a state might be entered by any other state, 
and so the recurrence relation is somewhat more complicated. Furthermore, 
the forward algorithm computes the sum of the probabilities of all possible 
paths which could generate the observation sequence, where the minimum 
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edit distance computed the minimum such probability. 5 Each cell of the for¬ 
ward algorithm matrix, forward[tJ\ represents the probability of being in 
state j after seeing the first t observations, given the automaton X. Since 
we have augmented our graphs with the word probability p{w ), our exam¬ 
ple of the forward algorithm here is actually computing this likelihood times 
p(w). The value of each cell forward[t , j] is computed by summing over the 
probabilities of every path that could lead us to this cell. Formally, each cell 
expresses the following probability: 


forward [?, j] = P(oi , o 2 ... o t ,q, = j |A,) P(w) (5.14) 

Here q t = j means The probability that the fth state in the sequence 
of states is state /. We compute this probability by summing over the ex¬ 
tensions of all the paths that lead to the current cell. An extension of a path 
from a state i at time t — 1 is computed by multiplying the following three 
factors: 

1. the previous path probability from the previous cell forward [/ — 1, /]. 

2. the transition probability aij from previous state i to current state j. 

3. the observation likelihood h that current state j matches observation 
symbol t. For the weighted automata that we consider here, bj t is 1 if 
the observation symbol matches the state, and 0 otherwise. Chapter 7 
will consider more complex observation likelihoods. 


The algorithm is described in Figure 5.16. 

Figure 5.17 shows the forward algorithm applied to the word need. The 
algorithm applies similarly to the other words which can produce the string 
[n iy], resulting in the probabilities on page 165. In order to compute the 
most probable underlying word, we run the forward algorithm separately on 
each of the candidate words, and choose the one with the highest probabil¬ 
ity. Chapter 7 will give further details of the mathematics of the forward 
algorithm and introduce the related forward-backward algorithm. 


5 The forward algorithm computes the sum because there may be multiple paths through 
the network which explain a given observation sequence. Chapter 7 will take up this point in 
more detail. 
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function Forward {observations,state-graph) returns forward-probability 

num-states •<— num-of-s tates (state -graph) 
num-obs <- length(ob sensations) 

Create probability matrix forward[num-states + 2, num-obs + 2] 
forward[0,0] <- 1.0 

for each time step t from 0 to num-obs do 
for each state s from 0 to num-states do 

for each transition s f from s specified by state-graph 
forward[s' 7+1] «— forward[sJ] * a[s , s'] * b[s f , o t ] 
return the sum of the probabilities in the final column of forward 


Figure 5.16 The forward algorithm for computing likelihood of observa¬ 
tion sequence given a word model. a[s,s'] is the transition probability from 
current state s to next state s' and b[s',o t \ is the observation likelihood of s’ 
given o t . For the weighted automata that we consider here, b[s',o t \ is 1 if the 
observation symbol matches the state, and 0 otherwise. 


end 

d 

need iy 

n 

start 


Figure 5.17 The forward algorithm applied to the word need , computing 
the probability P(0\w)P(w). While this example doesn’t require the full power 
of the forward algorithm, we will see its use on more complex examples in 
Chapter 7. 
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Decoding: The Viterbi Algorithm 

The forward algorithm as we presented it seems a bit of an overkill. Since 
only one path through the pronunciation networks will match the input string, 
why use such a big matrix and consider so many possible paths? Further¬ 
more, as a decoding method, it seems rather inefficient to run the forward 
algorithm once for each word (imagine how inefficient this would be if we 
were computing likelihoods for all possible sentences rather than all possible 
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words!) Part of the reason that the forward algorithm seems like overkill is 
that we have immensely simplified the pronunciation problem by assuming 
that our input consists of sequences of unambiguous symbols. We will see in 
Chapter 7 that when the observation sequence is a set of noisy acoustic val¬ 
ues, there are many possibly paths through the automaton, and the forward 
algorithm will play an important role in summing these paths. 

But it is true that having to run it separately on each word makes the 
forward algorithm a very inefficient decoding method. Luckily, there is a 
simple variation on the forward algorithm called the Viterbi algorithm which 
allows us to consider all the words simultaneously and still compute the most 
likely path. The term Viterbi is common in speech and language process¬ 
ing, but like the forward algorithm this is really a standard application of 
the classic dynamic programming algorithm, and again looks a lot like the 
minimum edit distance algorithm. The Viterbi algorithm was first applied 
to speech recognition by Vintsyuk (1968), but has what Kruskal (1983) calls 
a ‘remarkable history of multiple independent discovery and publication’; 
see the History section at the end of the chapter for more details. The name 
Viterbi is the one which is most commonly used in speech recognition, al¬ 
though the terms DP alignment (for Dynamic Programming alignment), 
dynamic time warping and one-pass decoding are also commonly used. 
The term is applied to the decoding algorithm for weighted automata and 
Hidden Markov Models on a single word and also to its more complex ap¬ 
plication to continuous speech, as we will see in Chapter 7. In this chapter 
we will show how the algorithm is used to find the best path through a net¬ 
work composed of single words, as a result choosing the word which is most 
probable given the observation sequence string of words. 

The version of the Viterbi algorithm that we will present takes as input 
a single weighted automaton and a set of observed phones o = { 01 O 2 O 3 ...o t ) 
and returns the most probable state sequence q = (#i# 2 # 3 -together 
with its probability. We can create a single weighted automaton by combin¬ 
ing the pronunciation networks for the four words in parallel with a single 
start and a single end state. Figure 5.18 shows the combined network. 

Figure 5.19 shows pseudocode for the Viterbi algorithm. Like the min¬ 
imum edit distance and forward algorithm, the Viterbi algorithm sets up a 
probability matrix, with one column for each time index t and one row for 
each state in the state graph. Also like the forward algorithm, each column 
has a cell for each state qi in the single combined automaton for the four 
words. In fact, the code for the Viterbi algorithm should look exactly like 
the code for the forward algorithm with two modifications. First, where the 
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.000024 
knee 



.00013 


Figure 5.18 The pronunciation networks for the words need , neat, new, and 
knee combined into a single weighted automaton. Again, word probabilities 
are not usually considered part of the pronunciation network for a word; they 
are added here to simplify the exposition of the Viterbi algorithm. 


forward algorithm places the sum of all previous paths into the current cell, 
the Viterbi algorithm puts the max of the previous paths into the current cell. 

The algorithm first creates TV + 2 or four state columns. The first col¬ 
umn is an initial pseudo-observation, the second corresponds to the first 
observation phone [n], the third to [iy] and the fourth to a final pseudo¬ 
observation. We begin in the first column by setting the probability of the 
start state to 1.0, and the other probabilities to 0; the reader should find this 
in Figure 5.20. Cells with probability 0 are simply left blank for readability. 

Then we move to the next state; as with the forward algorithm, for 
every state in column 0, we compute the probability of moving into each 
state in column 1. The value viterbi[t , j] is computed by taking the maximum 
over the extensions of all the paths that lead to the current cell. An extension 
of a path from a state i at time t — 1 is computed by multiplying the same 
three factors we used for the forward algorithm: 

1. the previous path probability from the previous cell forward[t — 1, /]. 

2. the transition probability atj from previous state i to current state j. 

3. the observation likelihood bj t that current state j matches observation 
symbol t . For the weighted automata that we consider here, b J} is 1 if 
the observation symbol matches the state, and 0 otherwise. Chapter 7 
will consider more complex observation likelihoods. 
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function Viterbi {observations of len Testate-graph) returns best-path 

num-states num-of-s tates ( state-graph) 

Create a path probability matrix viterbi[num-states+2,T+2] 

viterbi [0,0] 1.0 

for each time step t from 0 to 7 do 

for each state s from 0 to num-states do 

for each transition s f from s specified by state-graph 
new-score <-viterbi[s, t] * a[s,s r ] * b s >(o t ) 
if ((viterbi[s f J-\-l] = 0) || ( new-score > viterbi[s f , /+7])) 
then 

viterbi[s f , t+l]<-new-score 
back-pointer^, t+l]<-s 

Backtrace from highest probability state in the final column of viterbi[] and 
return path 


Figure 5.19 Viterbi algorithm for finding optimal sequence of states in con¬ 
tinuous speech recognition, simplified by using phones as inputs. Given an 
observation sequence of phones and a weighted automaton (state graph), the 
algorithm returns the path through the automaton which has maximum proba¬ 
bility and accepts the observation sequence. a[s,s f ] is the transition probability 
from current state s to next state s f and b[s\o t \ is the observation likelihood of 
s’ given o t . For the weighted automata that we consider here, b[s\o t \ is 1 if 
the observation symbol matches the state, and 0 otherwise. 


In Figure 5.20, in the column for the input n , each word starts with [n], 
and so each has a non-zero probability in the cell for the state n. Other cells 
in that column have zero entries, since their states don’t match n. When we 
proceed to the next column, each cell that matches iy gets updated with the 
contents of the previous cell times the transition probability to that cell. Thus 
the value of viterbi[2,\y new \ for the iy state of the word new is the product of 
the ‘word’ probability of new times the probability of new being pronounced 
with the vowel iy. Notice that if we look only at this iy column, that the word 
need is currently the ‘most-probable’ word. But when we move to the final 
column, the word new will win out, since need has a smaller transition prob¬ 
ability to end (.11) than new does (1.0). We can now follow the backpointers 
and backtrace to find the path that gave us this final probability of .00036. 
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Figure 5.20 The entries in the individual state columns for the Yiterbi al¬ 
gorithm. Each cell keeps the probability of the best path so far and a pointer 
to the previous cell along that path. Backtracing from the end state, we can 
reconstruct the state sequence n ne w tynew, arriving at the best word new. 


Weighted Automata and Segmentation 

Weighted automata and the Viterbi algorithm play an important in various 
algorithm for segmentation. Segmentation is the process of taking an undif¬ 
ferentiated sequence of symbols and ‘segmenting’ it into chunks. For exam¬ 
ple sentence segmentation is the problem of automatically finding the sen¬ 
tence boundaries in a corpus. Similarly word segmentation is the problem 
of finding word-boundaries in a corpus. In written English there is no dif¬ 
ficulty in segmenting words from each other because there are orthographic 
spaces between words. This is not the case in languages like Chinese and 
Japanese that use a Chinese-derived writing system. Written Chinese does 
not mark word boundaries. Instead, each Chinese character is written one 
after the other without spaces. Since each character approximately repre- 
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sents a single morpheme, and since words can be composed of one or more 
characters, it is often difficult to know where words should be segmented. 
Proper word-segmentation is necessary for many applications, particularly 
including parsing and text-to-speech (how a sentence is broken up into words 
influences its pronunciation in a number of ways). 

Consider the following example sentence from Sproat et al (1996): 

(5.15) 05#ftS*«? 

“How do you say ‘octopus’ in Japanese?” 

This sentence has two potential segmentations, only one of which is 
correct. In the plausible segmentation, the first two characters are combined 
to make the word for ‘Japanese language’ (05 ri-wen) (the accents indicate 
the tone of each syllable), and the next two are combined to make the word 
for ‘octopus’ (*ft zhang-yu). 

(5.16) 0 5: itt ? 

ri-wen zhang-yu zen-me shuo 
Japanese octopus how say 

‘How do you say octopus in Japanese?’ 

(5.17) 0 5* ft itt ? 

ri wen-zhang yu zen-me shuo 

Japan essay fish how say 

‘How do you say Japan essay fish?’ 

Sproat et al (1996) give a very simple algorithm which selects the 
correct segmentation by choosing the one which contains the most-frequent 
words. In other words, the algorithm multiplies together the probabilities of 
each word in a potential segmentation and chooses whichever segmentation 
results in a higher product probability. 

The implementation of their algorithm combines a weighted-finite- 
state transducer representation of a Chinese lexicon with the Yiterbi algo¬ 
rithm. This lexicon is a slight augmentation of the FST lexicons we saw 
in Chapter 4; each word is represented as a series of arcs representing each 
character in the word, followed by a weighted arc representing the proba¬ 
bility of the word. As is commonly true with probabilistic algorithms, they 
actually use the negative log probability of the word (—log(P(w)). The log 
probability is mainly useful because the product of many probabilities gets 
very small, and so using the log probability can help avoid underflow. Using 
log probabilities also means that we are adding costs rather than multiplying 
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probabilities , and that we are looking for the minimum cost solution rather 
than the maximum probability solution. 

Consider the example in Figure 5.21. This sample lexicon Figure 5.21(a) 
consists of only 5 potential words: 

Word Pronunciation Meaning Cost (—logp(w)) 


ri-wen 

‘Japanese’ 

10.63 

ri 

‘Japan’ 

6.51 

zhang- yu 

‘octopus’ 

13.18 

wen-zhang 

‘essay’ 

9.51 

yu 

‘fish’ 

10.28 


The system represents the input sentence as the unweighted FSA in 
Figure 5.21(b). In order to compose this input with the lexicon, it needs to 
be converted into an FST. The algorithm uses a function Id which takes an 
FSA A and returns the FST which maps all and only the strings accepted 
by A to themselves. Let D* represent the transitive closure of D, i.e. the 
automaton created by adding a loop from the end of the lexicon back to 
the beginning. The set of all possible segmentations is Id(I) o D*, i.e. the 
input transducer Id(I) composed with the transitive closure of the dictionary 
D, shown in Figure 5.21(c). Then the best segmentation is the lowest-cost 
segmentation in W(/)o/)*, shown in Figure 5.21(d). 

Finding the best path shown in Figure 5.21(d) can be done easily with 
the Yiterbi algorithm and is left as an exercise for the reader. 

This segmentation algorithm, like the spelling error correction algo¬ 
rithm we saw earlier, can also be extended to incorporate the cross-word 
probabilities (Y-gram probabilities) that will be introduced in Chapter 6. 


5.10 Pronunciation in Humans 

Section 5.7 discussed many factors which influence pronunciation variation 
in humans. In this section we very briefly summarize a computational model 
of the retrieval of words from the mental lexicon as part of human lexical 
production. The model is due to Gary Dell and his colleagues; for brevity 
we combine and simplify features of multiple models (Dell, 1986, 1988; 
Dell et al , 1997) in this single overview. First consider some data. As 
we suggested in Chapter 3, production errors such as slips of the tongue 
(darn bore instead barn door) often provide important insights into lexical 
production. Dell (1986) summarizes a number of previous results about such 
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(a) Dictionary D 


: wen/0.000 



(b) Input I 

n wen zhang yu 

-©—•©—•©—•©—-O 


(c) ld(D) o D* 


e* g/6 51 : weil /0-000 : zhang/0.000 e: e/9.51 : yu/0.000 



©-©-<f> 


e: e/10.28 


©-© 

e: e/ 10.63 : zhang/0.000 : yu/0.000 



(d) BestPath(ld(D) o D*) 


: ri/0.000 : wen/0.000 E : e/10.63 : zhang/0.000 : yu/0.000 e: e/13.18 

■©-©-©—-*© - © - © - ■© 


Figure 5.21 The Sproat et al (1996) algorithm applied to four input words 
(after Sproat et al (1996)) 


slips. The lexical bias effect is that slips are more likely to create words than 
non-words; thus slips like dean bad-^t bean dad are three times more likely 
than slips like deal back-^r bead dack . The repeated-phoneme bias is that 
two phones in two words are likely to participate in an error if there is an 



182 


Chapter 5. Probabilistic Models of Pronunciation and Spelling 


identical phone in both words. Thus deal beack is more likely to slip to beal 
than deal back is. 

The model which Dell (1986, 1988) proposes is a network with 3 lev¬ 
els: semantics, word (lemma), and phonemes. 6 The semantics level has 
nodes for concepts, the lemma level has one node for each words, and the 
phoneme level has separate nodes for each phone, separated into onsets, 
vowels, and codas. Each lemma node is connected to the phoneme units 
which comprise the word, and the semantic units which represent the con¬ 
cept. Connections are used to pass activation from node to node, and are 
bidirectional and excitatory. Lexical production happens in two stages. In 
the first stage, activation passes from the semantic concepts to words. Ac¬ 
tivation will cascade down into the phonogical units and then back up into 
other word units. At some point the most highely activated word is selected. 
In the second stage, this selected is given a large jolt of activation. Again 
this activation passes to the phonological level. Now the most highly active 
phoneme nodes are selected and accessed in order. 

Figure 5.22 shows Dell’s model. Errors occur because too much acti¬ 
vation reaches the wrong phonological node. Lexical bias, for example, is 
modeled by activation spreading up from the phones of the intended word to 
neighboring words, which then activated their own phones. Thus incorrect 
phones get ‘extra’ activation if they are present in actual words. 

The two-step network model also explains other facts about lexical 
aphasic production. Aphasic speakers have various troubles in language production 
and comprehension, often caused by strokes or accidents. Dell et al (1997) 
show that weakening various connections in a network model like the one 
above can also account for the speech errors in aphasics. This supports the 
continuity hypothesis , which suggests that some part of aphasia is merely an 
extension of normal difficulties in word retrieval, and also provides further 
evidence for the network model. Readers interested in details of the model 
should see the above references and related computational models such as 
Roelofs (1997), which extends the network model to deal with syllabifica¬ 
tion, phonetic encoding, and more complex sequential structure, and Levelt 
etal (1999). 


6 Dell (1988) also has a fourth level for syllable structure which we will ignore here. 
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Semantics 



Figure 5.22 The network model of Dell (1986, 1988), showing the mech¬ 
anism for lexical bias (modified from Dell (1988, p. 134)). The boldfaced 
nodes indicate nodes with lots of activation. The intended word dean has a 
greater chance of slipping to bean because of the existence of the bean node. 
The boldfaced lines show the connections which account for the possible slip. 


5.11 Summary 


This chapter has introduced some essential metaphors and algorithms that 
will be useful throughout speech and language processing. The main points 
are as follows: 

• We can represent many language problems as if a clean string of sym¬ 
bols had been corrupted by passing through a noisy channel and it is 
our job to recover the original symbol string. One powerful way to 
recover the original symbol string is to consider all possible original 
strings, and rank them by their conditional probability. 

• The conditional probability is usually easiest to compute using the 
Bayes Rule, which breaks down the probability into a prior and a 
likelihood. For spelling error correction or pronunciation-modeling, 
the prior is computed by taking word frequencies or word bigram fre¬ 
quencies. The likelihood is computed by training a simple probabilistic 
model (like a confusion matrix, a decision tree, or a hand-written rule) 
on a database. 
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• The task of computing the distance between two strings comes up 
in spelling error correction and other problems. The minimum edit 
distance algorithm is an application of the dynamic programming 
paradigm to solving this problem, and can be used to produce the dis¬ 
tance between two strings or an alignment of the two strings. 

• The pronunciation of words is very variable. Pronunciation variation 
is caused by two classes of factors: lexical variation and allophonic 
variation. Lexical variation includes sociolinguistic factors like di¬ 
alect and register or style. 

• The single most important factor affecting allophonic variation is the 
identity of the surrounding phones. Other important factors include 
syllable structure, stress patterns, and the identity and frequency of the 
word. 

• The decoding task is the problem of finding determining the correct 
‘underlying’ sequence of symbols that generated the ‘noisy’ sequence 
of observation symbols. 

• The forward algorithm is an efficient way of computing the likeli¬ 
hood of an observation sequence given a weighted automata. Like the 
minimum edit distance algorithm, it is a variant of dynamic program¬ 
ming. It will prove particularly in Chapter 7 when we consider Hidden 
Markov Models, since it will allow us to sum multiple paths that each 
account for the same observation sequence. 

• The Viterbi algorithm, another variant of dynamic programming, is 
an efficient way of solving the decoding problem by considering all 
possible strings and using the Bayes Rule to compute their probabilities 
of generating the observed ‘noisy’ sequence. 

• Word segmentation in languages without word-boundary markers, 
like Chinese and Japanese, is another kind of optimization task which 
can be solved by the Viterbi algorithm. 


Bibliographical and Historical Notes 

Algorithms for spelling error detection and correction have existing since 
at least Blair (1960). Most early algorithm were based on similarity keys 
like the Soundex algorithm discussed in the exercises on page 89 (Odell and 
Russell, 1922; Knuth, 1973). Damerau (1964) gave a dictionary-based al¬ 
gorithm for error detection; most error-detection algorithms since then have 
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been based on dictionaries. Damerau also gave a correction algorithm that 
worked for single errors. Most algorithms since then have relied on dynamic 
programming, beginning with Wagner and Fischer (1974) (see below). Ku- 
kich (1992) is the definitive survey article on spelling error detection and 
correction. Only much later did probabilistic algorithms come into vogue 
for non-OCR spelling-error correction (for example Kashyap and Oommen 
(1983) and Kemighan et al (1990)). 

By contrast, the field of optical character recognition developed prob¬ 
abilistic algorithms quite early; Bledsoe and Browning (1959) developed a 
probabilistic approach to OCR spelling error correction that used a large dic¬ 
tionary and computed the likelihood of each observed letter sequence given 
each word in the dictionary by multiplying the likelihoods for each letter. 
In this sense Bledsoe and Browning also prefigured the modem Bayesian 
approaches to speech recognition. (Shinghal and Toussaint, 1979) and (Hull 
and Srihari, 1982) applied bigram letter-transition probabilities and the Yiterbi 
algorithm to choose the most likely correct form for a misspelled OCR input. 

The application of dynamic programming to the problem of sequence 
comparison has what Kruskal (1983) calls a ‘remarkable history of multiple 
independent discovery and publication’. Kruskal and others give at least the 
following independently-discovered variants of the algorithm published in 
four separate fields: 


Citation 


Field 


Yiterbi (1967) 

Yintsyuk (1968) 

Needleman and Wunsch (1970) 
Sakoe and Chiba (1971) 
Sankoff (1972) 

Reichert et al (1973) 

Wagner and Fischer (1974) 


information theory 
speech processing 
molecular biology 
speech processing 
molecular biology 
molecular biology 
computer science 


To the extent that there is any standard to terminology in speech and 
language processing, it is the use of the term Yiterbi for the application of 
dynamic programming to any kind of probabilistic maximization problem. 
For non-probabilistic problems, the plain term dynamic programming is 
often used. The history of the forward algorithm, which derives from Hid¬ 
den Markov Models, will be summarized in Chapter 7. Sankoff and Kruskal 
(1983) is a collection exploring the theory and use of sequence comparison 
in different fields. Forney (1973) is an early survey paper which explores the 
origin of the Yiterbi algorithm in the context of information and communi- 
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cations theory. 

The weighted finite-state automata was first described by (Pereira et al , 
1994), drawing from a combination of work in finite-state transducers and 
work in probabilistic languages (Booth and Thompson, 1973). 
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Exercises 

5.1 Computing minimum edit distances by hand, figure out whether drive 
is closer to brief or to divers , and what the edit distance is. You may use any 
version of distance that you like. 

5.2 Now implement a minimum edit distance algorithm and use your hand- 
computed results to check your code. 

5.3 The Yiterbi algorithm can be used to extend a simplified version of 
the Kemighan et al (1990) spelling error correction algorithm. Recall that 
the Kemighan et al (1990) algorithm only allowed a single spelling error 
for each potential correction. Let’s simplify by assuming that we only have 
three confusion matrices instead of four (del, ins and sub; no trans). Now 
show how the Yiterbi algorithm can be used to extend the Kemighan et al 
(1990) algorithm to handle multiple spelling errors per word. 

5.4 To attune your ears to pronunciation reduction, listen for the pronun¬ 
ciation of the word the, a, or to in the spoken language around you. Try to 
notice when it is reduced, and mark down whatever facts about the speaker 
or speech situation that you can. What are your observations? 

5.5 Find a speaker of a different dialect of English than your own (even 
someone from a slightly different region of your native dialect) and tran¬ 
scribe (using the ARPAbet or IPA) 10 words that they pronounce differently 
than you. Can you spot any generalizations? 

5.6 Implement the Forward algorithm. 

5.7 Write a modified version of the Viterbi algorithm which solves the seg¬ 
mentation problem from Sproat et al (1996). 

5.8 Now imagine a version of English that was written without spaces. 

Apply your segmentation program to this ‘compressed English’. You will 
need other programs to compute word bigrams or trigrams. 

5.9 Two words are confusable if they have phonetically similar pronunci- confusable 
ations. Use one of your dynamic programming implementations to take two 

words and output a simple measure of how confusable they are. You will 
need to use an on-line pronunciation dictionary. You will also need a metric 
for how close together two phones are. Use your favorite set of phonetic 
feature vectors for this. You may assume some small constant probability of 
phone insertion and deletion. 





N-GRAMS 


But it must be recognized that the notion ‘probability of a sen¬ 
tence 9 is an entirely useless one, under any known interpretation 
of this term . 

Noam Chomsky (1969, p. 57) 

Anytime a linguist leaves the group the recognition rate goes up . 
Fred Jelinek (then of the IBM speech group) (1988) 1 


Imagine listening to someone as they speak and trying to guess the next 
word that they are going to say. For example what word is likely to follow 
this sentence fragment?: 

I’d like to make a collect... 

Probably the most likely word is call, although it’s possible the next 
word could be telephone , or person-to-person or international (Think of 
some others). Guessing the next word (or word prediction) is an essen- prediction 
tial subtask of speech recognition, hand-writing recognition, augmentative 
communication for the disabled, and spelling error detection. In such tasks, 
word-identification is difficult because the input is very noisy and ambigu¬ 
ous. Thus looking at previous words can give us an important cue about 
what the next ones are going to be. Russell and Norvig (1995) give an exam¬ 
ple from Take the Money and Run , in which a bank teller interprets Woody 
Allen’s sloppily written hold-up note as saying “I have a gub”. A speech 

1 In an address to the first Workshop on the Evaluation of Natural Language Processing 
Systems, December 7, 1988. While this workshop is described in Palmer and Finin (1990), 
the quote was not written down; some participants remember a more snappy version: Every 
time I fire a linguist the performance of the recognizer improves. 
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recognition system (and a person) can avoid this problem by their knowl¬ 
edge of word sequences (“a gub” isn’t an English word sequence) and of 
their probabilities (especially in the context of a hold-up, “I have a gun” will 
have a much higher probability than “I have a gub” or even “I have a gull”). 

This ability to predict the next word is important for augmentative 
communication systems (Newell et al , 1998). These are computer sys¬ 
tems that help the disabled in communication. For example, people who 
are unable to use speech or sign-language to communicate, like the physi¬ 
cist Steven Hawkings, use systems that speak for them, letting them choose 
words with simple hand movements, either by spelling them out, or by se¬ 
lecting from a menu of possible words. But spelling is very slow, and a menu 
of words obviously can’t have all possible English words on one screen. 
Thus it is important to be able to know which words the speaker is likely to 
want to use next, so as to put those on the menu. 

Finally, consider the problem of detecting real-word spelling errors. 
These are spelling errors that result in real words of English (although not 
the ones the writer intended) and so detecting them is difficult (we can’t find 
them by just looking for words that aren’t in the dictionary). Figure 6.1 gives 
some examples. 

They are leaving in about fifteen minuets to go to her house. 

The study was conducted mainly be John Black. 

The design an construction of the system will take more than a year. 

Hopefully, all with continue smoothly in my absence. 

Can they lave him my messages? 

I need to notified the bank of [this problem.] 

He is trying to fine out. _ 

Figure 6.1 Some attested real-word spelling errors from Kukich (1992). 


These errors can be detected by algorithms which examine, among 
other features, the words surrounding the errors. For example, while the 
phrase in about fifteen minuets is perfectly grammatical English, it is a very 
unlikely combination of words. Spellcheckers can look for low probability 
combinations like this. In the examples above the probability of three word 
combinations ( they lave him , to fine out , to notified the ) is very low. Of 
course sentences with no spelling errors may also have low probability word 
sequences, which makes the task challenging. We will see in Section 6.6 that 
there are a number of different machine learning algorithms which make use 
of the surrounding words and other features to do context-sensitive spelling 
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error correction. 

Guessing the next word turns out to be closely related to another prob¬ 
lem: computing the probability of a sequence of words. For example the 
following sequence of words has a non-zero probability of being encoun¬ 
tered in a text written in English: 

... all of a sudden I notice three guys standing on the sidewalk 

taking a very good long gander at me. 

while this same set of words in a different order probably has a very low 
probability: 

good all I of notice a taking sidewalk the me long three at sudden 

guys gander on standing a a the very 

Algorithms that assign a probability to a sentence can also be used to 
assign a probability to the next word in an incomplete sentence, and vice 
versa. We will see in later chapters that knowing the probability of whole 
sentences or strings of words is useful in part-of-speech-tagging (Chapter 8), 
word-sense disambiguation, and probabilistic parsing Chapter 12. 

In speech recognition, it is traditional to use the term language model 
or LM for a statistical model of word sequences. In the rest of this chapter 
we will be using both language model and grammar, depending on the 
context. 

6.1 Counting Words in Corpora 


[upon being asked if there weren’t enough words in the English language for him]: 
“ Yes, there are enough, but they aren't the right ones." 

James Joyce, reported in Bates (1997) 


Probabilities are based on counting things. Before we talk about prob¬ 
abilities, we need to decide what we are going to count and where we are 
going to find the things to count. 

As we saw in Chapter 5, statistical processing of natural language is 
based on corpora (singular corpus), on-line collections of text and speech. 
For computing word probabilities, we will be counting words in a training 
corpus. Let’s look at part of the Brown Corpus, a 1 million word collection 
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of samples from 500 written texts from different genres (newspaper, nov¬ 
els, non-fiction, academic, etc.), which was assembled at Brown University 
in 1963-64 (Kucera and Francis, 1967; Francis, 1979; Francis and Kucera, 
1982). It contains sentence (6.1); how many words are in this sentence? 

(6.1) He stepped out into the hall, was delighted to encounter a water 

brother. 

Example 6.1 has 13 words if we don’t count punctuation-marks as 
words, 15 if we count punctuation. Whether we treat period (V), comma 
(V), and so on as words depends on the task. There are tasks such as 
grammar-checking, spelling error detection, or author-identification, for which 
the location of the punctuation is important (for checking for proper capital¬ 
ization at the beginning of sentences, or looking for interesting patterns of 
punctuation usage that uniquely identify an author). In natural language pro¬ 
cessing applications, question-marks are an important cue that someone has 
asked a question. Punctuation is a useful cue for part-of-speech tagging. 
These applications, then, often count punctuation as words. 

Unlike text corpora, corpora of spoken language usually don’t have 
punctuation, but speech corpora do have other phenomena that we might or 
might not want to treat as words. One speech corpus, the Switchboard corpus 
of telephone conversations between strangers, was collected in the early 
1990’s and contains 2430 conversations averaging 6 minutes each, for a total 
of 240 hours of speech and 3 million words (Godfrey et at, 1992). Here’s 
a sample utterance of Switchboard (since the units of spoken language are 
utterance different than written language, we will use the word utterance rather than 
‘sentence’ when we are referring to spoken language): 

(6.2) I do uh main- mainly business data processing 

This utterance, like many or most utterances in spoken language, has 
fragments fragments, words that are broken off in the middle, like the first instance 

pauses of the word mainly , represented here as main-. It also has filled pauses like 

uh , which doesn’t occur in written English. Should we consider these to be 
words? Again, it depends on the application. If we are building an automatic 
dictation system based on automatic speech recognition, we might want to 
strip out the fragments. But the uhs and urns are in fact much more like 
words. For example, Smith and Clark (1993) and Clark (1994) have shown 
that um has a slightly different meaning than uh (generally speaking urn is 
used when speakers are having major planning problems in producing an 
utterance, while uh is used when they know what they want to say, but are 
searching for the exact words to express it). Stolcke and Shriberg (1996b) 
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also found that uh can be a useful cue in predicting the next word (why might 
this be?), and so most speech recognition systems treat uh as a word. 

Are capitalized tokens like They and uncapitalized tokens like they the 
same word? For most statistical applications these are lumped together, 
although sometimes (for example for spelling error correction or part-of- 
speech-tagging) the capitalization is retained as a separate feature. For the 
rest of this chapter we will assume our models are not case-sensitive. 

How should we deal with inflected forms like cats versus caft Again, 
this depends on the application. Most current A-gram based systems are 
based on the wordform, which is the inflected form as it appears in the 
corpus. Thus these are treated as two separate words. This is not a good 
simplification for many domains, which might want to treat cats and cat as 
instances of a single abstract word, or lemma. A lemma is a set of lexical 
forms having the same stem, the same major part of speech, and the same 
word-sense. We will return to the distinction between wordforms (which 
distinguish cat and cats) and lemmas (which lump cat and cats together) in 
Chapter 16. 

How many words are there in English? One way to answer this ques¬ 
tion is to count in a corpus. We use types to mean the number of distinct 
words in a corpus, i.e. the size of the vocabulary, and tokens to mean the 
total number of running words. Thus the following sentence from the Brown 
corpus has 16 word tokens and 14 word types (not counting punctuation): 

(6.3) They picnicked by the pool, then lay back on the grass and looked at 

the stars. 

The Switchboard corpus has 2.4 million wordform tokens and ap¬ 
proximately 20,000 wordform types. This includes proper nouns. Spoken 
language is less rich in its vocabulary than written language: Kucera (1992) 
gives a count for Shakespeare’s complete works at 884,647 wordform tokens 
from 29,066 wordform types. Thus each of the 884,647 wordform tokens is 
a repetition of one of the 29,066 wordform types. The 1 million wordform 
tokens of the Brown corpus contain 61,805 wordform types that belong to 
37,851 lemma types. All these corpora are quite small. Brown et al (1992) 
amassed a corpus of 583 million wordform tokens of English that included 
293,181 different wordform types. 

Dictionaries are another way to get an estimate of the number of words, 
although since dictionaries generally do not include inflected forms they are 
better at measuring lemmas than wordforms. The American Heritage 3rd 
edition dictionary has 200,000 “boldface forms”; this is somewhat higher 
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than the true number of lemmas, since there can be one or more boldface 
form per lemma (and since the boldface forms includes multiword phrases). 

The rest of this chapter will continue to distinguish between types and 
tokens. ‘Types’ will mean wordform types and not lemma types, and punc¬ 
tuation marks will generally be counted as words. 


6.2 Simple (Unsmoothed) N-grams 

The models of word sequences we will consider in this chapter are proba¬ 
bilistic models; ways to assign probabilities to strings of words, whether for 
computing the probability of an entire sentence or for giving a probabilistic 
prediction of what the next word will be in a sequence. As we did in Chap¬ 
ter 5, we will assume that the reader has a basic knowledge of probability 
theory. 

The simplest possible model of word sequences would simply let any 
word of the language follow any other word. In the probabilistic version of 
this theory, then, every word would have an equal probability of following 
every other word. If English had 100,000 words, the probability of any word 
following any other word would be yqoToo or *00001. 

In a slightly more complex model of word sequences, any word could 
follow any other word, but the following word would appear with its nor¬ 
mal frequency of occurrence. For example, the word the has a high relative 
frequency, it occurs 69,971 times in the Brown corpus of 1,000,000 words 
(i.e. 1% of the words in this particular corpus are the). By contrast the word 
rabbit occurs only 11 times in the Brown corpus. 

We can use these relative frequencies to assign a probability distribu¬ 
tion across following words. So if we’ve just seen the string Anyhow ; we can 
use the probability .07 for the and .00001 for rabbit to guess the next word. 
But suppose we’ve just seen the following string: 

Just then, the white 

In this context rabbit seems like a more reasonable word to follow 
white than the does. This suggests that instead of just looking at the in¬ 
dividual relative frequencies of words, we should look at the conditional 
probability of a word given the previous words. That is, the probability 
of seeing rabbit given that we just saw white (which we will represent as 
P(rabbit\white )) is higher than the probability of rabbit otherwise. 

Given this intuition, let’s look at how to compute the probability of a 
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complete string of words (which we can represent either as w \... w n or w”). 
If we consider each word occurring in its correct location as an independent 
event, we might represent this probability as follows: 

P{w h w 2 ...,w n - U w n ) (6.4) 

We can use the chain rule of probability to decompose this probability: 

P(iv”) = P{w\)P(w 2 \w\)P{wi\w\)...P(w n \w\- 1 ) 

= f\P(mH-') (6.5) 

k = 1 

But how can we compute probabilities like P{w n \w r {~ l )l We don’t 
know any easy way to compute the probability of a word given a long se¬ 
quence of preceding words. (For example, we can’t just count the number of 
times every word occurs following every long string; we would need far too 
large a corpus). 

We solve this problem by making a useful simplification: we approxi¬ 
mate the probability of a word given all the previous words. The approxima¬ 
tion we will use is very simple: the probability of the word given the single 
previous word! The bigram model approximates the probability of a word 
given all the previous words P(w M |w” -1 ) by the conditional probability of 
the preceding word P{w n \w n -\). In other words, instead of computing the 
probability 

P(rabbit| Just the other I day I saw a) (6.6) 

we approximate it with the probability 

P(rabbit|a) (6.7) 

This assumption that the probability of a word depends only on the 
previous word is called a Markov assumption. Markov models are the class 
of probabilistic models that assume that we can predict the probability of 
some future model without looking too far into the past. We saw this use of 
the word Markov in introducing the Markov chain in Chapter 5. Recall that 
a Markov chain is a kind of weighted finite-state automaton; the intuition of 
the term Markov in Markov chain is that the next state of a weighted FSA is 
always dependent on a finite history (since the number of states in a finite- 
state automaton is finite). The simple bigram model can be viewed as a 
simple kind of Markov chain which has one state for each word. 

We can generalize the bigram (which looks one word into the past) to 
the trigram (which looks two words into the past) and thus to the N-gram 
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(which looks A — 1 words into the past). A bigram is called a first-order first-order 
Markov model (because it looks one token into the past), a trigram is a 
second-order Markov model, and in general an A-gram is a A — 1th or¬ 
der Markov model. Markov models of words were common in engineering, 
psychology, and linguistics until Chomsky’s influential review of Skinner’s 
Verbal Behavior in 1958 (see the History section at the back of the chapter), 
but went out of vogue until the success of A-gram models in the IBM speech 
recognition laboratory at the Thomas J. Watson Research Center, brought 
them back to the attention of the community. 

The general equation for this A-gram approximation to the conditional 
probability of the next word in a sequence is: 

%W‘) «%K:i+i) (6.8) 

Equation 6.8 shows that the probability of a word w n given all the pre¬ 
vious words can be approximated by the probability given only the previous 
A words. 

For a bigram grammar, then, we compute the probability of a complete 
string by substituting equation 6.8 into equation 6.5. The result: 

p(w1) « f\P{w k \w k ^) (6.9) 

k = 1 

Let’s look at an example from a speech-understanding system. The 
Berkeley Restaurant Project is a speech-based restaurant consultant; users 
ask questions about restaurants in Berkeley, California, and the system dis¬ 
plays appropriate information from a database of local restaurants (Jurafsky 
et al , 1994). Here are some sample user queries: 

I’m looking for Cantonese food. 

I’d like to eat dinner someplace nearby. 

Tell me about Chez Panisse. 

Can you give me a listing of the kinds of food that are available? 

I’m looking for a good place to eat breakfast. 

I definitely do not want to have cheap Chinese food. 

When is Caffe Venezia open during the day? 

I don’t wanna walk more than ten minutes. 

Table 6.2 shows a sample of the bigram probabilities for some of the 
words that can follow the word eat , taken from actual sentences spoken by 
users (putting off just for now the algorithm for training bigram probabil¬ 
ities). Note that these probabilities encode some facts that we think of as 
strictly syntactic in nature (like the fact that what comes after eat is usually 
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something that begins a noun phrase, i.e. an adjective, quantifier or noun), 
as well as facts that we think of as more culturally based (like the low prob¬ 
ability of anyone asking for advice on finding British food). 


eat on 

.16 

eat Thai 

.03 

eat some 

.06 

eat breakfast 

.03 

eat lunch 

.06 

eat in 

.02 

eat dinner 

.05 

eat Chinese 

.02 

eat at 

.04 

eat Mexican 

.02 

eat a 

.04 

eat tomorrow 

.01 

eat Indian 

.04 

eat dessert 

.007 

eat today 

.03 

eat British 

.001 

Figure 6.2 A fragment of a bigram grammar from the Berkeley Restaurant 
Project showing the most likely words to follow eat. 


Assume that in addition to the probabilities in Table 6.2, our grammar 
also includes the bigram probabilities in Table 6.3 (<s> is a special word 
meaning ‘Start of sentence’). 


<S> I 

.25 

I want 

.321 

want to 

.65 

to eat 

.26 

British food 

.60 

<s> I’d 

.06 

I would .29 

want a 

.05 

to have 

.14 

British restaurant 

.15 

<s> Tell .04 

I don’t 

.08 

want some .04 

to spend .09 

British cuisine 

.01 

<s> I’m 

.02 

I have 

.04 

want thai 

.01 

to be 

.02 

British lunch 

.01 


Figure 6.3 More fragments from the bigram grammar from the Berkeley 
Restaurant Project. 


Now we can compute the probability of sentences like I want to eat 
British food ox I want to eat Chinese food by simply multiplying the appro¬ 
priate bigram probabilities together, as follows: 

P(I want to eat British food) = P(I|<s>)P(want|I)P(to|want)P(eat|to) 

P(British|eat)P(food|British) 

= .25 * .32 * .65 * .26 * .002 * .60 
= .000016 

As we can see, since probabilities are all less than 1 (by definition), the 
product of many probabilities gets smaller the more probabilities we multi¬ 
ply. This causes a practical problem: the risk of numerical underflow. If we 
are computing the probability of a very long string (like a paragraph or an 
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entire document) it is more customary to do the computation in log space; we 
take the log of each probability (the logprob), add all the logs (since adding 
in log space is equivalent to multiplying in linear space) and then take the 
anti-log of the result. For this reason many standard programs for computing 
A-grams actually store and calculate all probabilities as logprobs. In this text 
we will always report logs in base 2 (i.e. we will use log to mean log 2 ). 

A trigram model looks just the same as a bigram model, except that 
we condition on the two previous words (e.g. we use P(food\eat British) 
instead of /^(foodl British)). To compute trigram probabilities at the very 
beginning of sentence, we can use two pseudo-words for the first trigram 
(i.e. P(/| < startl >< start2 >)). 

A-gram models can be trained by counting and normalizing (for prob¬ 
abilistic models, normalizing means dividing by some total count so that the 
resulting probabilities fall legally between 0 and 1). We take some training 
corpus, and from this corpus take the count of a particular bigram, and divide 
this count by the sum of all the bigrams that share the same first word: 


P(w n \w n -l) = 


C(w n -\W n ) 

Y, w C{w n -\w) 


( 6 . 10 ) 


We can simplify this equation, since the sum of all bigram counts that 
start with a given word w n -\ must be equal to the unigram count for that 
word w n - 1 . (The reader should take a moment to be convinced of this): 


P(w„\w n -i) = 


C(w„-iw„) 


C(w n - 1 ) 

For the general case of A-gram parameter estimation: 


P{Wn\K-N+l) = 


CK_; + ivy») 

CW-AT+l) 


( 6 . 11 ) 


( 6 . 12 ) 


Equation 6.12 estimates the A-gram probability by dividing the ob¬ 
served frequency of a particular sequence by the observed frequency of a 
prefix. This ratio is called a relative frequency; the use of relative fre¬ 
quencies as a way to estimate probabilities is one example of the technique 
known as Maximum Likelihood Estimation or MLE, because the resulting 
parameter set is one in which the likelihood of the training set T given the 
model M (i.e. P(T\M)) is maximized. For example, suppose the word Chi¬ 
nese occurs 400 times in a corpus of a million words like the Brown corpus. 
What is the probability that it will occur in some other text of way a million 
words? The MLE estimate of its probability is 10 qq° qq or .0004. Now .0004 
is not the best possible estimate of the probability of Chinese occurring in all 
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situations; but it is the probability that makes it most likely that Chinese will 
occur 400 times in a million-word corpus. 

There are better methods of estimating /V-gram probabilities than using 
relative frequencies (we will consider a class of important algorithms in Sec¬ 
tion 6.3), but even the more sophisticated algorithms make use in some way 
of this idea of relative frequency. Figure 6.4 shows the bigram counts from a 
piece of a bigram grammar from the Berkeley Restaurant Project. Note that 
the majority of the values are zero. In fact we have chosen the sample words 
to cohere with each other; a matrix selected from a random set of 7 words 
would be even more sparse. 
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Figure 6.4 Bigram counts for 7 of the words (out of 1616 total word types) 

in the Berkeley Restaurant Project corpus of' 

TO,000 sentences. 



Figure 6.5 shows the bigram probabilities after normalization (dividing 
each row by the following appropriate unigram counts: 
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3437 

want 
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to 

3256 
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938 
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459 


More on N-grams and their sensitivity to the training corpus 

In this section we look at a few examples of different TV-gram models to 
get an intuition for two important facts about their behavior. The first is the 
increasing accuracy of /V-gram models as we increase the value of N. The 
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Figure 6.5 Bigram probabilities for 7 of the words (out of 1616 total word 

types) in the Berkeley Restaurant Project corpus of "10,000 sentences. 



second is their very strong dependency on their training corpus (in particular 
its genre and its size in words). 

We do this by borrowing a visualization technique proposed by Shan¬ 
non (1951) and also used by Miller and Selfridge (1950). The idea is to train 
various A-grams and then use each to generate random sentences. It’s sim¬ 
plest to visualize how this works for the unigram case. Imagine all the words 
of English covering the probability space between 0 and 1. We choose a ran¬ 
dom number between 0 and 1, and print out the word that covers the real 
value we have chosen. The same technique can be used to generate higher 
order A-grams by first generating a random bigram that starts with <s> (ac¬ 
cording to its bigram probability), then choosing a random bigram to follow 
it (again, where the likelihood of following a particular bigram is propor¬ 
tional to its conditional probability), and so on. 

To give an intuition for the increasing power of higher-order A-grams, 
we trained a unigram, bigram, trigram, and a quadrigram model on the com¬ 
plete corpus of Shakespeare’s works. We then used these four grammars to 
generate random sentences. In the following examples we treated each punc¬ 
tuation mark as if it were a word in its own right, and we trained the gram¬ 
mars on a version of the corpus with all capital letters changed to lowercase. 
After generated the sentences we corrected the output for capitalization just 
to improve readability. Some of the resulting sentences: 

1. Unigram approximation to Shakespeare 

(a) To him swallowed confess hear both. Which. Of save on trail for are ay 
device and rote life have 

(b) Every enter now severally so, let 

(c) Hill he late speaks; or! a more to leg less first you enter 
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(d) Will rash been and by I the me loves gentle me not slavish page, the and 
hour; ill let 

(e) Are where exeunt and sighs have rise excellency took of.. Sleep knave 
we. near; vile like 

2. Bigram approximation to Shakespeare 

(a) What means, sir. I confess she? then all sorts, he is trim, captain. 

(b) Why dost stand forth thy canopy, forsooth; he is this palpable hit the 
King Henry. Live king. Follow. 

(c) What we, hath got so she that I rest and sent to scold and nature bankrupt, 
nor the first gentleman? 

(d) Enter Menenius, if it so many good direction found’sl thou art a strong 
upon command of fear not a liberal largess given away, Falstaff! Exeunt 

(e) Thou whoreson chops. Consumption catch your dearest friend, well, 
and I know where many mouths upon my undoing all but be, how soon, 
then; we’ll execute upon my love’s bonds and we do you will? 

(f) The world shall- my lord! 

3. Trigram approximation to Shakespeare 

(a) Sweet prince, Falstaff shall die. Harry of Monmouth’s grave. 

(b) This shall forbid it should be branded, if renown made it empty. 

(c) What is’t that cried? 

(d) Indeed the duke; and had a very good friend. 

(e) Fly, and will rid me these news of price. Therefore the sadness of part¬ 
ing, as they say, ’tis done. 

(f) The sweet! How many then shall posthumus end his miseries. 

4. Quadrigram approximation to Shakespeare 

(a) King Henry. What! I will go seek the traitor Gloucester. Exeunt some 
of the watch. A great banquet serv’d in; 

(b) Will you not tell me who I am? 

(c) It cannot be but so. 

(d) Indeed the short and the long. Marry, ’tis a noble Lepidus. 

(e) They say all lovers swear more performance than they are wont to keep 
obliged faith unforfeited! 

(f) Enter Leonato’s brother Antonio, and the rest, but seek the weary beds 
of people sick. 

The longer the context on which we train the model, the more coher¬ 
ent the sentences. In the unigram sentences, there is no coherent relation 
between words, and in fact none of the sentences end in a period or other 
sentence-final punctuation. The bigram sentences can be seen to have very 
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Methodology Box: Training Sets and Test Sets 

The probabilities in a statistical model like an iV-gram come from 
the corpus it is trained on* This training corpus needs to be care¬ 
fully designed. If the training corpus is too specific to the task or 
domain, the probabilities may be too narrow and not generalize well 
to new sentences. If the training corpus is too general, the probabil¬ 
ities may not do a sufficient job of reflecting the task or domain. 
Furthermore, suppose we are trying to compute the probabil¬ 
ity of a particular 'test 1 sentence. If our ‘test’ sentence is part of 
the training corpus, it will have an artificially high probability. The 
training corpus must not be biased by including this sentence. Thus 
when using a statistical model of language given some corpus of rel¬ 
evant data, we start by dividing the data into a training set and a test 
set, Wc train the statistical parameters of the model on the training 
set, and then use them to compute probabilities on the test set. 

This training-and-testing paradigm can also be used to evaluate 
different ALgram architectures* For example to compare the different 
smoothing algorithms we will introduce in Section 6.3, we can take 
a large corpus and divide it into a training set and a test set. Then 
we train the two different N-gram models on the training set and 
see which one better models the test set. But what does it mean to 
‘model the test set'? There is a useful metric for how well a given 
statistical model matches a test corpus, called perplexity. Perplexity 
is a variant of entropy, and will be introduced on page 221. 

In some cases wc need more than one test set* For example, 
suppose we have a few different possible language models and we 
want first to pick the best one and then to sec how it docs on a fair 
test set, Le* one we've never looked at before. We first use a devel¬ 
opment test set (also called a devtest set) to pick the best language 
model, and perhaps tune some parameters. Then once we come up 
with what we think is the best model, we run it on the true test set, 
When comparing models it is important to use statistical tests 
(introduced in any statistics class or textbook for the social sciences) 
to determine if the difference between Iwo models is significant, Co¬ 
hen (19 95) is a useful reference which focuses on statistical research 
methods for artificial intelligence. Dietterich (1998) focuses on sta¬ 
tistical tests for comparing classifiers. 
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local word-to-word coherence (especially if we consider that punctuation 
counts as a word). The trigram and quadrigram sentences are beginning to 
look a lot like Shakespeare. Indeed a careful investigation of the quadri¬ 
gram sentences shows that they look a little too much like Shakespeare. The 
words It cannot be but so are directly from King John. This is because 
the Shakespeare oeuvre, while large by many standards, is somewhat less 
than a million words. Recall that Kucera (1992) gives a count for Shake¬ 
speare’s complete works at 884,647 words (tokens) from 29,066 wordform 
types (including proper nouns). That means that even the bigram model is 
very sparse; with 29,066 types, there are 29,066 1 2 3 , or more than 844 million 
possible bigrams, so a 1 million word training set is clearly vastly insufficient 
to estimate the frequency of the rarer ones; indeed somewhat under 300,000 
different bigram types actually occur in Shakespeare. This is far too small to 
train quadrigrams; thus once the generator has chosen the first quadrigram 
(It cannot be but), there are only 5 possible continuations (that, I, he, thou, 
and so); indeed for many quadrigrams there is only one continuation. 

To get an idea of the dependence of a grammar on its training set, 
let’s look at an A-gram grammar trained on a completely different corpus: 
the Wall Street Journal (WSJ). A native speaker of English is capable of 
reading both Shakespeare and the Wall Street Journal; both are subsets of 
English. Thus it seems intuitive that our A-grams for Shakespeare should 
have some overlap with A-grams from the Wall Street Journal. In order to 
check whether this is true, here are three sentences generated by unigram, 
bigram, and trigram grammars trained on 40 million words of articles from 
the daily Wall Street Journal (these grammars are Katz backoff grammars 
with Good-Turing smoothing; we will learn in the next section how these are 
constructed). Again, we have corrected the output by hand with the proper 
English capitalization for readability. 

1. (unigram) Months the my and issue of year foreign new exchange’s 
September were recession exchange new endorsed a acquire to six ex¬ 
ecutives 

2. (bigram) Last December through the way to preserve the Hudson cor¬ 
poration N. B. E. C. Taylor would seem to complete the major central 
planners one point five percent of U. S. E. has already old M. X. corpo¬ 
ration of living on information such as more frequently fishing to keep 
her 

3. (trigram) They also point to ninety nine point six billion dollars from 
two hundred four oh six three percent of the rates of interest stores as 
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Mexico and Brazil on market conditions 

Compare these examples to the pseudo-Shakespeare on the previous 
page; while superficially they both seem to model ‘English-like sentences’ 
there is obviously no overlap whatsoever in possible sentences, and very lit¬ 
tle if any overlap even in small phrases. The difference between the Shake¬ 
speare and WSJ corpora tell us that a good statistical approximation to En¬ 
glish will have to involve a very large corpus with a very large cross-section 
of different genres. Even then a simple statistical model like an A-gram 
would be incapable of modeling the consistency of style across genres (we 
would only want to expect Shakespearean sentences when we are reading 
Shakespeare, not in the middle of a Wall Street Journal article). 

6.3 Smoothing 


Never do I ever want to hear another word! 

There isn't one, 

I haven't heard! 

Eliza Doolittle in Alan Jay Lemer’s My Fair Lady lyrics 


words people 
never use — 
could be 
only I 
know them 

Ishikawa Takuboku 1885-1912 

One major problem with standard A-gram models is that they must 
be trained from some corpus, and because any particular training corpus is 
finite, some perfectly acceptable English A-grams are bound to be missing 
sparse from it. That is, the bigram matrix for any given training corpus is sparse; 

it is bound to have a very large number of cases of putative ‘zero probability 
bigrams’ that should really have some non-zero probability. Furthermore, 
the MLE method also produces poor estimates when the counts are non-zero 
but still small. 

Some part of this problem is endemic to A-grams; since they can’t 
use long-distance context, they always tend to underestimate the probability 
of strings that happen not to have occurred nearby in their training corpus. 
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But there are some techniques we can use to assign a non-zero probability 
to these ‘zero probability bigrams’. This task of reevaluating some of the 
zero-probability and low-probability A-grams, and assigning them non-zero 
values, is called smoothing. In the next few sections we will introduce some 
smoothing algorithms and show how they modify the Berkeley Restaurant 
bigram probabilities in Figure 6.5. 


Add-One Smoothing 

One simple way to do smoothing might be just to take our matrix of bigram 
counts, before we normalize them into probabilities, and add one to all the 
counts. This algorithm is called add-one smoothing. Although this algo¬ 
rithm does not perform well and is not commonly used, it introduces many 
of the concepts that we will see in other smoothing algorithms, and also gives 
us a useful baseline. 

Let’s first consider the application of add-one smoothing to unigram 
probabilities, since that will be simpler. The unsmoothed maximum likeli¬ 
hood estimate of the unigram probability can be computed by dividing the 
count of the word by the total number of word tokens N: 


P(w x ) 


c{w x ) 

IaC(Wt) 

c{w x ) 


N 


The various smoothing estimates will rely on an adjusted count c*. The 
count adjustment for add-one smoothing can then be defined by adding one 
to the count and then multiplying by a normalization factor, where V 
is the total number of word types in the language, i.e. the vocabulary size. 
Since we are adding 1 to the count for each word type, the total number of 
tokens must be increased by the number of types. The adjusted count for 
add-one smoothing is then defined as: 


‘* = <« + 1 >jirTv <6 ' 13) 

and the counts can be turned into probabilities p* by normalizing by N. 

An alternative way to view a smoothing algorithm is as discounting 
(lowering) some non-zero counts in order to get the probability mass that 
will be assigned to the zero counts. Thus instead of referring to the dis¬ 
counted counts c*, many papers also define smoothing algorithms in terms 
of a discount d c , the ratio of the discounted counts to the original counts: 


SMOOTHING 


ADD-ONE 


VOCABULARY 

SIZE 


DISCOUNTING 


DISCOUNT 
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c 

Alternatively we can compute the probability p* directly from the counts 
as follows: 


* _ Ci + l 

Pi ~ N + V 

Now that we have the intuition for the unigram case, let’s smooth 
our Berkeley Restaurant Project bigram. Figure 6.6 shows the add-one- 
smoothed counts for the bigram in Figure 6.4. 
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Figure 6.6 Add-one Smoothed Bigram counts for 7 of the words (out of 

1616 total word types) in 

the Berkeley Restaurant Project corpus of 

~10,000 

sentences. 








Figure 6.7 shows the add-one-smoothed probabilities for the bigram in 
Figure 6.5. Recall that normal bigram probabilities are computed by nor¬ 
malizing each row of counts by the unigram count: 


P{w n \w n -\) = 


C{w n -\w n ) 
C{w n - 1 ) 


(6.14) 


For add-one-smoothed bigram counts we need to first augment the un¬ 
igram count by the number of total word types in the vocabulary V: 


p*{w n \w n -i) = 


C(w„-iw„) +1 
C(w„_i) + V 


(6.15) 


We need to add V (= 1616) to each of the unigram counts: 
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I 3437+1616 = 5053 

want 1215+1616 = 2931 

to 3256+1616 = 4872 

eat 938+1616 = 2554 

Chinese 213+1616 = 1829 

food 1506+1616 = 3122 

lunch 459+1616 = 2075 

The result is the smoothed bigram probabilities in Figure 6.7. 
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Figure 6.7 Add-one smoothed bigram probabilities for 7 of the words (out 
of 1616 total word types) in the Berkeley Restaurant Project corpus of ~10,000 
sentences. 


It is often convenient to reconstruct the count matrix so we can see 
how much a smoothing algorithm has changed the original counts. These 
adjusted counts can be computed by Equation 6.13. Figure 6.8 shows the 
reconstructed counts. 
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Figure 6.8 Add-one smoothed bigram counts for 7 of the words (out of 
1616 total word types) in the Berkeley Restaurant Project Corpus of ~10,000 
sentences. 
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Note that add-one smoothing has made a very big change to the counts. 
C{want to) changed from 786 to 331! We can see this in probability space 
as well: P(to\want) decreases from .65 in the unsmoothed case to .28 in the 
smoothed case. 

Looking at the discount d (the ratio between new and old counts) shows 
us how strikingly the counts for each prefix-word have been reduced; the 
bigrams starting with Chinese were discounted by a factor of 8! 
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The sharp change in counts and probabilities occurs because too much 
probability mass is moved to all the zeros. The problem is that we arbitrarily 
picked the value “1” to add to each count. We could avoid this problem by 
adding smaller values to the counts (‘add-one-half’ ‘add-one-thousandth’), 
but we would need to retrain this parameter for each situation. 

In general add-one smoothing is a poor method of smoothing. Gale and 
Church (1994) summarize a number of additional problems with the add-one 
method; the main problem is that add-one is much worse at predicting the 
actual probability for bigrams with zero counts than other methods like the 
Good-Turing method we will describe below. Furthermore, they show that 
variances of the counts produced by the add-one method are actually worse 
than those from the unsmoothed MLE method. 

Witten-Bell Discounting 

A much better smoothing algorithm that is only slightly more complex than 
di'scounIng Add-One smoothing we will refer to as Witten-Bell discounting (it is in¬ 
troduced as Method C in Witten and Bell (1991)). Witten-Bell discounting 
is based on a simple but clever intuition about zero-frequency events. Let’s 
think of a zero-frequency word or A-gram as one that just hasn’t happened 
yet. When it does happen, it will be the first time we see this new A-gram. 
So the probability of seeing a zero-frequency A-gram can be modeled by the 
probability of seeing an A-gram for the first time. This is a recurring concept 
in statistical language processing: 
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Key Concept #4. Things Seen Once: Use the count of things you’ve 
seen once to help estimate the count of things you’ve never seen. 


The idea that we can estimate the probability of ‘things we never saw’ 
with help from the count of ‘things we saw once’ will return when we dis¬ 
cuss Good-Turing smoothing later in this chapter, and then once again when 
we discuss methods for tagging an unknown word with a part-of-speech in 
Chapter 8. 

How can we compute the probability of seeing an Af-gram for the first 
time? By counting the number of times we saw N -grams for the first time in 
our training corpus. This is very simple to produce since the count of ‘first¬ 
time’ iV-grams is just the number of N-gram types we saw in the data (since 
we had to see each type for the first time exactly once). 

So we estimate the total probability mass of all the zero Af-grams with 
the number of types divided by the number of tokens plus observed types: 


I pi 


T 

N + T 


(6.16) 


Why do we normalize by the number of tokens plus types? We can 
think of our training corpus as a series of events; one event for each token 
and one event for each new type. So Equation 6.16 gives the Maximum 
Likelihood Estimate of the probability of a new type event occurring. Note 
that the number of observed types T is different than the ‘total types’ or 
‘vocabulary size V 9 that we used in add-one smoothing: T is the types we 
have already seen, while V is the total number of possible types we might 
ever see. 

Equation 6.16 gives the total ‘probability of unseen ALgrams’. We 
need to divide this up among all the zero N-grams. We could just choose 
to divide it equally. Let Z be the total number of N-grams with count zero 
(types; there aren’t any tokens). Each formerly-zero unigram now gets its 
equal share of the redistributed probability mass: z 


z= E 1 

z:q=0 

T 

Pi ~ Z(N+T) 


(6.17) 

(6.18) 


If the total probability of zero ALgrains is computed from Equation 6.16, 
the extra probability mass must come from somewhere; we get it by dis- 
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Z(W X 


counting the probability of all the seen A-grams as follows: 


Pi 


Cj 

N + T 


if(Q>0) 


(6.19) 


Alternatively, we can represent the smoothed counts directly as: 


cj = 


IJV_ if--o 

Z N+T' 11 u 
Q yy_j_ y 7 if Q ^ 0 


( 6 . 20 ) 


Witten-Bell discounting looks a lot like add-one smoothing for uni¬ 
grams. But if we extend the equation to bigrams we will see a big difference. 
This is because now our type-counts are conditioned on some history. In or¬ 
der to compute the probability of a bigram w n -\W n -2 we haven’t seen, we 
use The probability of seeing a new bigram starting with w n -\ ’. This lets our 
estimate of ‘first-time bigrams’ be specific to a word history. Words that tend 
to occur in a smaller number of bigrams will supply a lower ‘unseen-bigram’ 
estimate than words that are more promiscuous. 

We represent this fact by conditioning T, the number of bigram types, 
and A, the number of bigram tokens, on the previous word w x , as follows: 


E P*(Wi\w x ) 

i:c(w x wi )=0 


T(w x ) 

N(w x ) + T(w x ) 


( 6 . 21 ) 


Again, we will need to distribute this probability mass among all the 
unseen bigrams. Let Z again be the total number of bigrams with a given first 
word that have count zero (types; there aren’t any tokens). Each formerly- 
zero bigram now gets its equal share of the redistributed probability mass: 

Z{w x ) = £ 1 (6.22) 

z:c(w A w/)=0 


P*(Wi\Wi-i) 


T’Cwi-i) 

z(w ! -_i)(yv+rK_ 1 )) 


if (c 


Wj-lWi 


= 0) 


(6.23) 


As for the non-zero bigrams, we discount them in the same manner, by 
parameterizing T on the history: 


L P*( W i\ W *) 

i:c(w x W})>0 


c(w x Wj) 
c(w x ) + T(w x ) 


(6.24) 


To use Equation 6.24 to smooth the restaurant bigram from Figure 6.5, 
we will need the number of bigram types T (w) for each of the first words. 
Here are those values: 
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I 

95 

want 

76 

to 

130 

eat 

124 

Chinese 

20 

food 

82 

lunch 

45 


In addition we will need the Z values for each of these words. Since 
we know how many words we have in the vocabulary (V - 1,616), there are 
exactly V possible bigrams that begin with a given word w, so the number of 
unseen bigram types with a given prefix is V minus the number of observed 
types: 

Z(w) = V-T(w) (6.25) 

Here are those Z values: 

I 1,521 

want 1,540 

to 1,486 

eat 1,492 

Chinese 1,596 

food 1,534 

lunch 1,571 

Figure 6.9 shows the discounted restaurant bigram counts. 
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17 

2 

46 

Chinese 

2 

.012 

.012 

.012 

.012 

109 

1 

food 

18 

.059 

16 

.059 

.059 

.059 

.059 

lunch 

4 

.026 

.026 

.026 

.026 

1 

.026 


Figure 6.9 Witten-Bell smoothed bigram counts for 7 of the words (out of 

1616 total word types) in the Berkeley Restaurant Project corpus of 

~10,000 

sentences. 








The discount values for the Witten-Bell algorithm are much more rea¬ 
sonable than for add-one smoothing: 
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I .97 

want .94 

to .96 

eat .88 

Chinese .91 

food .94 

lunch .91 

It is also possible to use Witten-Bell (or other) discounting in a differ¬ 
ent way. In Equation (6.21), we conditioned the smoothed bigram proba¬ 
bilities on the previous word. That is, we conditioned the number of types 
T{w x ) and tokens A(w x ) on the previous word w x . But we could choose 
instead to treat a bigram as if it were a single event, ignoring the fact that 
it is composed of two words. Then T would be the number of types of all 
bigrams, and A would be the number of tokens of all bigrams that occurred. 
Treating the bigrams as a unit in this way, we are essentially discounting, not 
probability the conditional probability P{wj\w x ), but the joint probability P(w x W[). In 
this way the probability P(w x Wi) is treated just like a unigram probability. 
This kind of discounting is less commonly used than the ‘conditional’ dis¬ 
counting we walked through above starting with equation 6.21. (Although it 
is often used for the Good-Turing discounting algorithm described below). 

In Section 6.4 we show that discounting also plays a role in more so¬ 
phisticated language models. Witten-Bell discounting is commonly used in 
speech recognition systems such as Placeway et al (1993). 

Good-T\iring Discounting 

This section introduces a slightly more complex form of discounting than the 
Turing Witten-Bell algorithm called Good-Hiring smoothing. This section may be 
skipped by readers who are not focusing on discounting algorithms. 

The Good-Turing algorithm was first described by Good (1953), who 
credits Turing with the original idea; a complete proof is presented in Church 
et al (1991). The basic insight of Good-Turing smoothing is to re-estimate 
the amount of probability mass to assign to A-grams with zero or low counts 
by looking at the number of A-grams with higher counts. In other words, 
we examine N c , the number of A-grams that occur c times. We refer to the 
number of A-grams that occur c times as the frequency of frequency c. So 
applying the idea to smoothing the joint probability of bigrams, Nq is the 
number of bigrams b of count 0, Ai the number of bigrams with count 1, and 
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so on: 

N c = £ 1 (6.26) 

b:c{b)=c 

The Good-Turing estimate gives a smoothed count c* based on the set 
of N c for all c, as follows: 

c* = ( c +l)^ (6.27) 

For example, the revised count for the bigrams that never occurred (co) 
is estimating by dividing the number of bigrams that occurred once (the sin¬ 
gleton or ‘hapax legomenon’ bigrams N\) by the number of bigrams that singleton 
never occurred (No)- Using the count of things we’ve seen once to estimate 
the count of things we’ve never seen should remind you of the Witten-Bell 
discounting algorithm we saw earlier in this chapter. The Good-Turing al¬ 
gorithm was first applied to the smoothing of A-gram grammars by Katz, 
as cited in Nadas (1984). Figure 6.10 gives an example of the applica¬ 
tion of Good-Turing discounting to a bigram grammar computed by Church 
and Gale (1991) from 22 million words from the Associated Press (AP) 
newswire. The first column shows the count c, i.e. the number of observed 
instances of a bigram. The second column shows the number of bigrams that 
had this count. Thus 449,721 bigrams has a count of 2. The third column 
shows c*, the Good-Turing re-estimation of the count. 


c (MLE) 

N c 

c* (GT) 

0 

74,671,100,000 

0.0000270 

1 

2,018,046 

0.446 

2 

449,721 

1.26 

3 

188,933 

2.24 

4 

105,668 

3.24 

5 

68,379 

4.22 

6 

48,190 

5.19 

7 

35,709 

6.21 

8 

27,710 

7.24 

9 

22,280 

8.25 

Figure 6.10 Bigram ‘frequencies of frequencies’ from 22 million AP bi¬ 
grams, and Good-Turing re-estimations after Church and Gale (1991) 


Church et al (1991) show that the Good-Turing estimate relies on the 
assumption that the distribution of each bigram is binomial. The estimate 
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also assumes we know No, the number of bigrams we haven’t seen. We 
know this because given a vocabulary size of V, the total number of bigrams 
is V 2 . (No is V 2 minus all the bigrams we have seen). 

In practice, this discounted estimate c* is not used for all counts c. 
Large counts (where c>k for some threshold k ) are assumed to be reliable. 
Katz (1987) suggests setting k at 5. Thus we define 

c* =c for c> k (6.28) 


The correct equation for c* when some k is introduced (from Katz 
(1987)) is: 

(>+ p N c+i _ c f£±l M+i 

+ forl<c<*. 


i _ (k+l)N k+l 
N, 


(6.29) 


With Good-Turing discounting as with any other, it is usual to treat 
N-grams with low counts (especially counts of 1) as if the count was 0. 


6.4 Backoff 


The discounting we have been discussing so far can help solve the problem of 
zero frequency n-grams. But there is an additional source of knowledge we 
can draw on. If we have no examples of a particular trigram w n - 2 W n -\W n to 
help us compute P(w n \w n -\w n - 2 ), we can estimate its probability by using 
the bigram probability P(w n \w n -\). Similarly, if we don’t have counts to 
compute P(w„|w M _i), we can look to the unigram P(w n ). 

There are two ways to rely on this iV-gram ‘hierarchy’, deleted inter- 
terpolatIon polation and backoff. We will focus on backoff, although we give a quick 
backoff overview of deleted interpolation after this section. Backoff TV-gram model¬ 
ing is a nonlinear method introduced by Katz (1987). In the backoff model, 
like the deleted interpolation model, we build an TV-gram model based on an 
(N-l)-gmm model. The difference is that in backoff, if we have non-zero 
trigram counts, we rely solely on the trigram counts and don’t interpolate 
the bigram and unigram counts at all. We only ‘back off’ to a lower-order 
TV-gram if we have zero evidence for a higher-order 7/-gram. 

The trigram version of backoff might be represented as follows: 

P(wi\wi- 2 Wi-0, if C(wi- 2 Wi-iWi) > 0 

(XiP(W/|h’;_i), if C(W;_ 2 W/_iW I ) = 0 

(o.3U) 

mdC(wi-iWi) > 0 
QL 2 P(wi ), otherwise. 
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Let’s ignore the a values for a moment; we’ll discuss the need for 
these weighting factors below. Here’s a first pass at the (recursive) equation 
for representing the general case of this form of backoff. 


P{Wn\w n n -N+\) = ^KK-N+i) 

+0(p( Wb |<:L 1 )) ( ^( w »I<=Jv+ 2 ) ( 6 - 31 > 

Again, ignore the a and the P for the moment. Following Katz, we’ve 
used 0 to indicate the binary function that selects a lower-ordered model only 
if the higher-order model gives a zero probability: 


f 1, ifx = 0 
\ 0, otherwise. 


(6.32) 


and each P(-) is a MLE (i.e. computed directly by dividing counts). The 
next section will work through these equations in more detail. In order to do 
that, we’ll need to understand the role of the a values and how to compute 
them. 


Combining Backoff with Discounting 

Our previous discussions of discounting showed how to use a discounting 
algorithm to assign probability mass to unseen events. For simplicity, we 
assumed that these unseen events were all equally probable, and so the prob¬ 
ability mass got distributed evenly among all unseen events. Now we can 
combine discounting with the backoff algorithm we have just seen to be a 
little more clever in assigning probability to unseen events. We will use the 
discounting algorithm to tells us how much total probability mass to set aside 
for all the events we haven’t seen, and the backoff algorithm to tell us how 
to distribute this probability in a clever way. 

First, the reader should stop and answer the following question (don’t 
look ahead): Why did we need the a values in Equation 6.30 (or Equa¬ 
tion 6.31)? Why couldn’t we just have three sets of probabilities without 
weights? 

The answer: without a values, the result of the equation would not be 
a true probability! This is because the original F(w m |w”I^ + 1 ) we got from 
relative frequencies were true probabilities, i.e. if we sum the probability of 
a given w n over all ALgram contexts, we should get 1: 

Y^P(w n \WiWj) = 1 
‘J 


(6.33) 
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But if that is the case, if we back off to a lower order model when the 
probability is zero, we are adding extra probability mass into the equation, 
and the total probability of a word will be greater than 1! 

Thus any backoff language model must also be discounted. This ex- 
p plains the as and P in Equation 6.31. The P comes from our need to discount 

the MLE probabilities to save some probability mass for the lower-order N- 
grams. We will use P to mean discounted probabilities, and save P for plain 
old relative frequencies computed directly from counts. The a is used to en¬ 
sure that the probability mass from all the lower order N-grams sums up to 
exactly the amount that we saved by discounting the higher-order TV-grams. 
Here’s the correct final equation: 


P{Wn\K-N+l) 


P{Wn\K-N+\) 

+e(p(w„K:^ +1 )) 

• oc(w”_^ +1 )P(w n |h , ”_^ +2 ) 


(6.34) 


Now let’s see the formal definition of each of these components of the 
equation. We define P as the discounted (c*) MLE estimate of the conditional 
probability of an N-gram, as follows: 


P(w„ 


IV 


,n—l 
'n-N +1 


) = 


c *«-n+ l) 
c{w\~ N+l ) 


(6.35) 


This probability P will be slightly less than the MLE estimate 

C \™n-N+l) 

(i.e. on average the c* will be less than c). This will leave some probability 
mass for the lower order N-grams. Now we need to build the a weighting 
we’ll need for passing this mass to the lower-order N-grams. Let’s represent 
the total amount of left-over probability mass by the function p, a function of 
the N — 1-gram context. For a given N - 1-gram context, the total left-over 
probability mass can be computed by subtracting from 1 the total discounted 
probability mass for all N-grams starting with that context: 


PM-V+l) — ^ X P( W n\ W n-N+l) (6.36) 

w«:c(wj_ w+1 )>0 


This gives us the total probability mass that we are ready to distribute 
to all N - 1-gram (e.g. bigrams if our original model was a trigram). Each 
individual N — 1-gram (bigram) will only get a fraction of this mass, so we 
need to normalize (3 by the total probability of all the N— 1-grams (bigrams) 
that begin some N-gram (trigram). The final equation for computing how 
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much probability mass to distribute from an N-gram to an N — 1-gram is 
represented by the function a: 


/v+i) 


1 )>0 P{}Vn | ^ w _/v_|_i) 

1 — Lw li :c(H/;_ /v+1 )>0^( H? n| M; w -A^+2) 


(6.37) 


Note that a is a function of the preceding word string, i.e. of w* 
thus the amount by which we discount each trigram (d\ and the mass that 
gets reassigned to lower-order N-grams (a) are recomputed for every N- 
gram (more accurately for every N — 1-gram that occurs in any Af-gram). 

We only need to specify what to do when the counts of an TV — 1-gram 
context are 0, (i.e. when c(vv™_J^ +1 ) = 0) and our definition is complete: 

P(w„KZ N N X\) = PKICw) (6-38) 


and 


P(w n |<=J r+1 )=0 (6.39) 

and 

P«=JUi) = 1 (6-40) 

In Equation 6.35, the discounted probability P can be computed with 
the discounted counts c* from the Witten-Bell discounting (Equation 6.20) 
or with the Good-Turing discounting discussed below. 

Here is the backoff model expressed in a slightly clearer format in its 
trigram version: 


( P(Wi\Wi- 2 Wi-l), 


if C(wi-2Wi-\Wi) > 0 


P(Wi\Wi- 2 Wi-l) 


a{w n n _\)P{wi\wi-i), 


[ a(w„_i)P(wj), 


if C(wi- 2 Wi-iWi) = 0 
andC(wf_iH 7 /) > 0 
otherwise. 


In practice, when discounting, we usually ignore counts of 1, i.e. we 
treat /V-grams with a count of 1 as if they never occurred. 

Gupta et al (1992) present a variant backoff method of assigning prob¬ 
abilities to zero trigrams. 


6.5 Deleted Interpolation 


The deleted interpolation algorithm, due to Jelinek and Mercer (1980), com¬ 
bines different 7V-gram orders by linearly interpolating all three models when¬ 
ever we are computing any trigram. That is, we estimate the probability 
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P{wn\w n -iW n - 2 ) by mixing together the unigram, bigram, and trigram prob¬ 
abilities. Each of these is weighted by a linear weight X: 


P{w n \w n -\W n -2) = 'k\P{w n \w n -iW n -2 ) 

+A.2/ 3 (w ra |w w _i) 

+taP(w„) 

such that the A,s sum to 1: 

I> = 1 


(6.41) 


(6.42) 


DELETED IN¬ 
TERPOLATION 


In practice, in this deleted interpolation algorithm we don’t train just 
three Xs for a trigram grammar. Instead, we make each X a function of the 
context. This way if we have particularly accurate counts for a particular 
bigram, we assume that the counts of the trigrams based on this bigram will 
be more trustworthy, and so we can make the lambdas for those trigrams 
higher and thus give that trigram more weight in the interpolation. So a 
more detailed version of the interpolation formula would be: 


P{w n \w n -2Wn-\) = M<- 2 ) P ( w «K- 2 W ra -l) 

+M<=2)^KK-l) 

+ U{<l\)P{w n ) (6.43) 

Given the P(w.„) values, the X values are trained so as to maximize the 
likelihood of a held-out corpus separate from the main training corpus, using 
a version of the EM algorithm defined in Chapter 7 (Baum, 1972; Dempster 
et al , 1977; Jelinek and Mercer, 1980). Further details of the algorithm are 
described in Bahl et al (1983). 


6.6 V-GRAMS FOR SPELLING AND PRONUNCIATION 


In Chapter 5 we saw the use of the Bayesian/noisy-channel algorithm for 
correcting spelling errors and for picking a word given a surface pronunci¬ 
ation. We saw that both these algorithms failed, returning the wrong word, 
because they had no way to model the probability of multiple-word strings. 
Now that our n-g rams give us such a model, we return to these two problems. 
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Context-Sensitive Spelling Error Correction 

Chapter 5 introduced the idea of detecting spelling errors by looking for 
words that are not in a dictionary, are not generated by some finite-state 
model of English word-formation, or have low probability orthotactics. But 
none of these techniques is sufficient to detect and correct real-word spelling 
errors, real-word error detection. This is the class of errors that result 
in an actual word of English. This can happen from typographical errors 
(insertion, deletion, transposition) that accidently produce a real word (e.g. 
there for three), or because the writer substituted the wrong spelling of a 
homophone or near-homophone (e.g. dessert for desert , or piece for peace). 
The task of correcting these errors is called context-sensitive spelling error 
correction. 

How important are these errors? By an a priori analysis of single typo¬ 
graphical errors (single insertions, deletions, substitutions, or transpositions) 
Peterson (1986) estimates that 15% of such spelling errors produce valid En¬ 
glish words (given a very large list of 350,000 words). Kukich (1992) sum¬ 
marizes a number of other analyses based on empirical studies of corpora, 
which give figures between of 25% and 40% for the percentage of errors 
that are valid English words. Figure 6.11 gives some examples from Kukich 
(1992), broken down into local and global errors. Local errors are those that 
are probably detectable from the immediate surrounding words, while global 
errors are ones in which error detection requires examination of a large con¬ 
text. 

One method for context-sensitive spelling error correction is based on 
A-grams. 

The word A-gram approach to spelling error detection and correction 
was proposed by Mays et al. (1991). The idea is to generate every possible 
misspelling of each word in a sentence either just by typographical modifica¬ 
tions (letter insertion, deletion, substitution), or by including homophones as 
well, (and presumably including the correct spelling), and then choosing the 
spelling that gives the sentence the highest prior probability. That is, given 
a sentence W = {w i, w> 2 , —, w*, —, w n }, where w* has alternative spelling 
w f k , w", etc, we choose the spelling among these possible spellings that max¬ 
imizes P{W), using the A-gram grammar to compute P(TT). A class-based 
A-gram can be used instead, which can find unlikely part-of-speech combi¬ 
nations, although it may not do as well at to finding unlikely word combina¬ 
tions. 

There are many other statistical approaches to context-sensitive spelling 


REAL-WORD 

ERROR 

DETECTION 
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Local Errors 

The study was conducted mainly be John Black. 

They are leaving in about fifteen minuets to go to her house. 

The design an construction of the system will take more than a year. 
Hopefully, all with continue smoothly in my absence. 

Can they lave him my messages? 

I need to notified the bank of [this problem.] 

He need to go there right no w. 

He is trying to fine out. 

Global Errors 

Won’t they heave if next Monday at that time? 

This thesis is supported by the fact that since 1989 the system 
has been operating system with all four units on-line, but... 

Figure 6.11 Some attested real-word spelling errors from Kukich (1992), 
broken down into local and global errors. 


error correction, some proposed directly for spelling, other for more general 
types of lexical disambiguation (such as word-sense disambiguation or ac¬ 
cent restoration). Beside the trigram approach we have just described, these 
include Bayesian classifiers, alone or combined with trigrams (Gale et al , 
1993; Golding, 1997; Golding and Schabes, 1996), decision lists (Yarowsky, 
1994), transformation based learning (Mangu and Brill, 1997), latent se¬ 
mantic analysis (Jones and Martin, 1997), and Winnow (Golding and Roth, 
1999). In a comparison of these, Golding and Roth (1999) found the Win¬ 
now algorithm gave the best performance. In general, however, these algo¬ 
rithms are very similar in many ways; they are all based on features like 
word and part-of-speech Af-grams, and Roth (1998, 1999) shows that many 
of them make their predictions using a family of linear predictors called Lin¬ 
ear Statistical Queries (LSQ) hypotheses. Chapter 17 will define all these 
algorithms and discuss these issues further in the context of word-sense dis¬ 
ambiguation. 

N- grams for Pronunciation Modeling 

The N-g ram model can also be used to get better performance on the words- 
from-pronunciation task that we studied in Chapter 5. Recall that the input 
was the pronunciation [n iy] following the word /. We said that the five words 
that could be pronounced [n iy] were need , new , neat , the , and knee. The 
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algorithm in Chapter 5 was based on the product of the unigram probability 
of each word and the pronunciation likelihood, and incorrectly chose the 
word new , based mainly on its high unigram probability. 

Adding a simple bigram probability, even without proper smoothing, is 
enough to solve this problem correctly. In the following table we fix the table 
on page 165 by using a bigram rather than unigram word probability p(w ) 
for each of the five candidate words (given that the word / occurs 64,736 
times in the combined Brown and Switchboard corpora): 

Word C(Tw) C(Tw)+0.5 p(w|T) 


need 

153 

153.5 .0016 

new 

0 

0.5 .000005 

knee 

0 

0.5 .000005 

the 

17 

17.5 .00018 

neat 

0 

0.5 .000005 


Incorporating this new word probability into combined model, it now 
predicts the correct word need , as the table below shows: 


Word p(y w) p(w) 

p(y|w)p(w) 

need 

.ii 

.0016 

.00018 

knee 

1.00 

.000005 

.000005 

neat 

.52 

.000005 

.0000026 

new 

.36 

.000005 

.0000018 

the 

0 

.00018 

0 


6.7 Entropy 

I got the horse right here 

Frank Loesser, Guys and Dolls 

Entropy and perplexity are the most common metrics used to evaluate 
N-gmm systems. The next sections summarize a few necessary fundamental 
facts about information theory and then introduce the entropy and perplex¬ 
ity metrics. We strongly suggest that the interested reader consult a good 
information theory textbook; Cover and Thomas (1991) is one excellent ex¬ 
ample. 

Entropy is a measure of information, and is invaluable in natural lan- entropy 
guage processing, speech recognition, and computational linguistics. It can 
be used as a metric for how much information there is in a particular gram- 
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mar, for how well a given grammar matches a given language, for how pre¬ 
dictive a given TV-gram grammar is about what the next word could be. Given 
two grammars and a corpus, we can use entropy to tell us which grammar 
better matches the corpus. We can also use entropy to compare how diffi¬ 
cult two speech recognition tasks are, and also to measure how well a given 
probabilistic grammar matches human grammars. 

Computing entropy requires that we establish a random variable X that 
ranges over whatever we are predicting (words, letters, parts of speech, the 
set of which we’ll call %), and that has a particular probability function, call 
it p(x). The entropy of this random variable X is then 

H ( x ) = - T p( x ) lo §2 p(x) (6-44) 

The log can in principle be computed in any base; recall that we use log 
base 2 in all calculations in this book. The result of this is that the entropy is 
measured in bits. 

The most intuitive way to define entropy for computer scientists is to 
think of the entropy as a lower bound on the number of bits it would take 
to encode a certain decision or piece of information in the optimal coding 
scheme. 

Cover and Thomas (1991) suggest the following example. Imagine 
that we want to place a bet on a horse race but it is too far to go all the way 
to Yonkers Racetrack, and we’d like to send a short message to the bookie 
to tell him which horse to bet on. Suppose there are eight horses in this 
particular race. 

One way to encode this message is just to use the binary representation 
of the horse’s number as the code; thus horse 1 would be 0 01, horse 2 010, 
horse 3 011, and so on, with horse 8 coded as 00 0. If we spend the whole 
day betting, and each horse is coded with 3 bits, on the average we would be 
sending 3 bits per race. 

Can we do better? Suppose that the spread is the actual distribution of 
the bets placed, and that we represent it as the prior probability of each horse 
as follows: 


Horse 1 \ 

Horse 5 ^ 

Horse 2 i 

Horse 6 ^ 

Horse 3 | 

Horse 7 ^ 

Horse 4 ^ 

Horse 8 ^ 


The entropy of the random variable X that ranges over horses gives us 
a lower bound on the number of bits, and is: 
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H{X) = -J>(«)log|>(») 

i= 1 

= - flog k- 5 log ilog 1.- ± log ±-4 (^ log ^) 

= 2 bits (6.45) 

A code that averages 2 bits per race can be built by using short en¬ 
codings for more probable horses, and longer encodings for less probable 
horses. For example, we could encode the most likely horse with the code 
0, and the remaining horses as 10, then 110, 1110, 111100, 111101, 
111110,and 111111. 

What if the horses are equally likely? We saw above that if we use an 
equal-length binary code for the horse numbers, each horse took 3 bits to 
code, and so the average was 3. Is the entropy the same? In this case each 
horse would have a probability of The entropy of the choice of horses is 
then: 

H ( x ) = - L ^ lo § ^ = - lo § ^ = 3 bits ( 6 - 46 > 

The value 2 H is called the perplexity (Jelinek et al , 1977; Bahl et al , 
1983). Perplexity can be intuitively thought of as the weighted average num¬ 
ber of choices a random variable has to make. Thus choosing between 8 
equally likely horses (where H = 3 bits), the perplexity is 2 3 or 8. Choosing 
between the biased horses in the table above (where H = 2 bits), the perplex¬ 
ity is 2 2 or 4. 

Until now we have been computing the entropy of a single variable. 
But most of what we will use entropy for involves sequences ; for a grammar, 
for example, we will be computing the entropy of some sequence of words 
W = {. .. wo, w\ , w 2,..., w n }. One way to do this is to have a variable that 
ranges over sequences of words. For example we can compute the entropy 
of a random variable that ranges over all finite sequences of words of length 
b in some language L as follows: 

H( Wl ,w 2 ,...,Wn) = - Z pWOlogp^) (6.47) 

w[ l eL 

We could define the entropy rate (we could also think of this as the 
per-word entropy) as the entropy of this sequence divided by the number 
of words: 

-H (Wf) = -- Z P( W T) !°g/>(WT) (6-48) 

n n w[ l eL 


PERPLEXITY 


ENTROPY 

RATE 
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STATIONARY 


CROSS 

ENTROPY 


But to measure the true entropy of a language, we need to consider 
sequences of infinite length. If we think of a language as a stochastic process 
L that produces a sequence of words, its entropy rate H(L) is defined as: 


H(L) = lim -H(w\,wz,...,w n ) 

f l 

= lim - Y p(wi,...,w„)logp(wi,...,w„) (6.49) 

n ^ n wt L 


The Shannon-McMillan-Breiman theorem (Algoet and Cover, 1988; 
Cover and Thomas, 1991) states that if the language is regular in certain 
ways (to be exact, if it is both stationary and ergodic), 


H{L)= lim — -logp(wiW 2 ...w n ) 

n^oo fi 


(6.50) 


That is, we can take a single sequence that is long enough instead 
of summing over all possible sequences. The intuition of the Shannon- 
McMillan-Breiman theorem is that a long enough sequence of words will 
contain in it many other shorter sequences, and that each of these shorter se¬ 
quences will reoccur in the longer sequence according to their probabilities. 

A stochastic process is said to be stationary if the probabilities it as¬ 
signs to a sequence are invariant with respect to shifts in the time index. In 
other words, the probability distribution for words at time t is the same as the 
probability distribution at time t + 1. Markov models, and hence A-grams, 
are stationary. For example, in a bigram, P; is dependent only on P/_ \. So if 
we shift our time index by x, Pi+ X is still dependent on P z+A _ \. But natural 
language is not stationary, since as we will see in Chapter 9, the probability 
of upcoming words can be dependent on events that were arbitrarily distant 
and time dependent. Thus our statistical models only give an approximation 
to the correct distributions and entropies of natural language. 

To summarize, by making some incorrect but convenient simplifying 
assumptions, we can compute the entropy of some stochastic process by tak¬ 
ing a very long sample of the output, and computing its average log probabil¬ 
ity. In the next section we talk about the why and how; why we would want to 
do this (i.e. for what kinds of problems would the entropy tell us something 
useful), and how to compute the probability of a very long sequence. 


Cross Entropy for Comparing Models 

In this section we introduce the cross entropy, and discuss its usefulness in 
comparing different probabilistic models. The cross entropy is useful when 
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we don’t know the actual probability distribution p that generated some data. 
It allows us to use some m, which is a model of p (i.e. an approximation to 
p. The cross-entropy of m on p is defined by: 

H(p,m) = lim - Y /?(wi,...,w„)logm(wi,...,w„) (6.51) 

n ^°° n wt L 

That is we draw sequences according to the probability distribution /?, 
but sum the log of their probability according to m. 

Again, following the Shannon-McMillan-Breiman theorem, for a sta¬ 
tionary ergodic process: 

H(p,m) lim —-logm(w>iW 2 ---w«) (6.52) 

n—>oo ft 

What makes the cross entropy useful is that the cross entropy H(p,m) 
is an upper bound on the entropy H(p). For any model m: 

H(p)<H(p,m) (6.53) 

This means that we can use some simplified model m to help estimate 
the true entropy of a sequence of symbols drawn according to probability 
p. The more accurate m is, the closer the cross entropy H(p,m) will be to 
the true entropy H(p). Thus the difference between H(p,m) and H(p) is 
a measure of how accurate a model is. Between two models m\ and m 2 , 
the more accurate model will be the one with the lower cross-entropy. (The 
cross-entropy can never be lower than the true entropy, so a model cannot 
err by underestimating the true entropy). 

The Entropy of English 

As we suggested in the previous section, the cross-entropy of some model 
m can be used as an upper bound on the true entropy of some process. We 
can use this method to get an estimate of the true entropy of English. Why 
should we care about the entropy of English? 

One reason is that the true entropy of English would give us a solid 
lower bound for all of our future experiments on probabilistic grammars. 
Another is that we can use the entropy values for English to help under¬ 
stand what parts of a language provide the most information (for example, 
is the predictability of English mainly based on word order, on semantics, 
on morphology, on constituency, or on pragmatic cues?) This can help us 
immensely in knowing where to focus our language-modeling efforts. 

There are two common methods for computing the entropy of English. 
The first was employed by Shannon (1951), as part of his groundbreaking 
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Methodology Box: Perplexity 


The methodology box on page 202 mentioned the idea of com¬ 
puting the perplexity of a test set as a way of comparing two prob¬ 
abilistic models. (Despite the risk of ambiguity, we will follow the 
speech and language processing literature in using the term ‘perplex¬ 
ity’ rather than the more technically correct term ‘cross-perplexity’.) 
Here’s an example of perplexity computation as part of a ‘business 
news dictation system’. We trained uni gram, bigram, and trigram 
Katz-style backoff grammars with Good-Turing discounting on 38 
million words (including start-of-sentence tokens) from the Wall 
Street Journal (from the WSJO corpus (LDC, 1993)). Wc used a 
vocabulary of 19,979 words (i.e. the rest of the words types were 
mapped to the unknown word token <UNK> in both training and 
testing). We then computed the perplexity of each of these models 
on a test set of 1.5 million words (where the perplexity is defined as 
The table below shows the perplexity of a 1.5 million word 
WSJ test set according to each of these grammars. 

iV-gram order Perplexity 
Unigram 962 

Bigram 170 

Trigram 109 

In computing perplexities the model m must be constructed 
without any knowledge of the test set t. Any kind of knowledge 
of the test set can cause the perplexity to be artificially low. For 
example, sometimes instead of mapping all unknown words to the 
<UNK> token, we use a closed-vocabulary test set in which we 
know in advance what the set of words is. This can greatly reduce 
the perplexity. As long as this knowledge Is provided equally to each 
of the models we are comparing, the closed-vocabulary perplexity is 
still a useful metric for comparing models. But this cross-perplexity 
is no longer guaranteed to be greater than the true perplexity of the 
test set, and so great care must be taken in interpreting the results. In 
general, the perplexity of two language models is only comparable 
i f they use the same vocabulary. 
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work in defining the field of information theory. His idea was to use human 
subjects, and to construct a psychological experiment that requires them to 
guess strings of letters; by looking at how many guesses it takes them to 
guess letters correctly we can estimate the probability of the letters, and 
hence the entropy of the sequence. 

The actual experiment is designed as follows: we present a subject 
with some English text and ask the subject to guess the next letter. The 
subjects will use their knowledge of the language to guess the most proba¬ 
ble letter first, the next most probable next, etc. We record the number of 
guesses it takes for the subject to guess correctly. Shannon’s insight was that 
the entropy of the number-of-guesses sequence is the same as the entropy 
of English. (The intuition is that given the number-of-guesses sequence, we 
could reconstruct the original text by choosing the “nth most probable” letter 
whenever the subject took n guesses). This methodology requires the use of 
letter guesses rather than word guesses (since the subject sometimes has to 
do an exhaustive search of all the possible letters!), and so Shannon com¬ 
puted the per-letter entropy of English rather than the per-word entropy. 
He reported an entropy of 1.3 bits (for 27 characters (26 letters plus space)). 
Shannon’s estimate is likely to be too low, since it is based on a single text 
(Jefferson the Virginian by Dumas Malone). Shannon notes that his subjects 
had worse guesses (hence higher entropies) on other texts (newspaper writ¬ 
ing, scientific work, and poetry). More recently variations on the Shannon 
experiments include the use of a gambling paradigm where the subjects get 
to bet on the next letter (Cover and King, 1978; Cover and Thomas, 1991). 

The second method for computing the entropy of English helps avoid 
the single-text problem that confounds Shannon’s results. This method is to 
take a very good stochastic model, train it on a very large corpus, and use 
it to assign a log-probability to a very long sequence of English, using the 
Shannon-McMillan-Breiman theorem: 

H (English) < lim - - \ogm(w\W 2 • • • w n ) (6.54) 

n—fi 

For example, Brown et al (1992) trained a trigram language model 
on 583 million words of English, (293,181 different types) and used it to 
compute the probability of the entire Brown corpus (1,014,312 tokens). The 
training data include newspapers, encyclopedias, novels, office correspon¬ 
dence, proceedings of the Canadian parliament, and other miscellaneous 
sources. 

They then computed the character-entropy of the Brown corpus, by us¬ 
ing their word-trigram grammar to assign probabilities to the Brown corpus, 
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considered as a sequence of individual letters. They obtained an entropy 
of 1.75 bits per character (where the set of characters included all the 95 
printable ASCII characters). 

The average length of English written words (including space) has been 
reported at 5.5 letters (Nadas, 1984). If this is correct, it means that the Shan¬ 
non estimate of 1.3 bits per letter corresponds to a per-word perplexity of 142 
for general English. The numbers we report above for the WSJ experiments 
are significantly lower since the training and test set came from same sub¬ 
sample of English. That is, those experiments underestimate the complexity 
of English since the Wall Street Journal looks very little like Shakespeare. 


Bibliographical and Historical Notes 

The underlying mathematics of the A-gram was first proposed by Markov 
(1913), who used what are now called simple Markov chains or bigrams 
to model sequences of 20,000 vowels and consonants in Pushkin’s Eugene 
Onegin. Markov classified each letter as V or C and computed the prob¬ 
ability of occurrence of sequences such as YYV, YCY, CYC, etc. Shan¬ 
non (1948) applied A-grams to compute approximations to English word 
sequences. Based on Shannon’s work, Markov models were commonly used 
in modeling word sequences by the 1950’s. In a series of extremely influ¬ 
ential papers starting with Chomsky (1956) and including Chomsky (1957) 
and Miller and Chomsky (1963), Noam Chomsky argued that ‘finite-state 
Markov processes’, while a possibly useful engineering heuristic, were in¬ 
capable of being a complete cognitive model of human grammatical knowl¬ 
edge. These arguments led many linguists and computational linguists away 
from statistical models altogether. 

The resurgence of A-gram models came from Jelinek, Mercer, Bahl, 
and colleagues at the IBM Thomas J. Watson Research Center, influenced 
by Shannon, and Baker at CMU, influenced by the work of Baum and col¬ 
leagues. These two labs independently successfully used A-grams in their 
speech recognition systems (Jelinek, 1976; Baker, 1975; Bahl et al., 1983). 
The Good-Turing algorithm was first applied to the smoothing of A-gram 
grammars at IBM by Katz, as cited in Nadas (1984). Jelinek (1990) summa¬ 
rizes this and many other early language model innovations used in the IBM 
language models. 

While smoothing had been applied as an engineering solution to the 
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zero-frequency problem at least as early as Jeffreys (1948) (add-one smooth¬ 
ing), it is only relatively recently that smoothing received serious atten¬ 
tion. Church and Gale (1991) gives a good description of the Good-Turing 
method, as well as the proof, and also gives a good description of the Deleted 
Interpolation method and a new smoothing method. Sampson (1996) also 
has a useful discussion of Good-Turing. Problems with the Add-one algo¬ 
rithm are summarized in Gale and Church (1994). Method C in Witten and 
Bell (1991) describes what we called Witten-Bell discounting. Chen and 
Goodman (1996) give an empirical comparison of different smoothing algo¬ 
rithms, including two new methods, average-count and one-count , as well as 
Church and Gale’s. Iyer and Ostendorf (1997) discuss a way of smoothing 
by adding in data from additional corpora. 

Much recent work on language modeling has focused on ways to build 
more sophisticated A-grams. These approaches include giving extra weight 
to A-grams which have already occurred recently (the cache LM of Kuhn cache lm 
and de Mori (1990)), choosing long-distance triggers instead of just local triggers 
A-grams (Rosenfeld, 1996; Niesler and Woodland, 1999; Zhou and Lua, 

1998), and using variable-length A-grams (Ney et al , 1994; Kneser, 1996; length^ 
Niesler and Woodland, 1996). Another class of approaches use semantic in¬ 
formation to enrich the A-gram, including semantic word associations based 
on the latent semantic indexing described in Chapter 15 (Coccaro and Ju- sem^ttic 
rafsky, 1998; Bellegarda, 1999)), and from on-line dictionaries or thesauri 
(Demetriou et al , 1997). Class-based A-grams, based on word classes such class-based 
as parts-of-speech, are described in Chapter 8. Language models based on 
more structured linguistic knowledge (such as probabilistic parsers) are de¬ 
scribed in Chapter 12. Finally, a number of augmentations to A-grams are 
based on discourse knowledge, such as using knowledge of the current topic 
(Chen etal , 1998; Seymore and Rosenfeld, 1997; Seymore et al, 1998; Flo- 
rian and Yarowsky, 1999; Khudanpur and Wu, 1999) or the current speech 
act in dialog (see Chapter 19). 


6.8 Summary 

This chapter introduced the A-gram, one of the oldest and most broadly use¬ 
ful practical tools in language processing. 

• An A-gram probability is the conditional probability of a word given 
the previous A — 1 words. A-gram probabilities can be computed by 
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simply counting in a corpus and normalizing (the Maximum Likeli¬ 
hood Estimate) or they can be computed by more sophisticated algo¬ 
rithms. The advantage of /V-grams is that they take advantage of lots 
of rich lexical knowledge. A disadvantage for some purposes is that 
they are very dependent on the corpus they were trained on. 

• Smoothing algorithms provide a better way of estimating the proba¬ 
bility of TV-grams which never occur. Commonly-used smoothing al¬ 
gorithms include backoff or deleted interpolation, with Witten-Bell 
or Good-Turing discounting. 

• Corpus-based language models like iV-grams are evaluated by sepa¬ 
rating the corpus into a training set and a test set, training the model 
on the training set, and evaluating on the test set. The entropy H, or 
more commonly the perplexity 2 H (more properly cross-entropy and 
cross-perplexity) of a test set are used to compare language models. 


Exercises 

6.1 Write out the equation for trigram probability estimation (modifying 
Equation 6.11) 

6.2 Write out the equation for the discount d = ^ for add-one smoothing. 
Do the same for Witten-Bell smoothing. How do they differ? 

6.3 Write a program (Perl is sufficient) to compute unsmoothed unigrams 
and bigrams. 

6.4 Run your Af-gram program on two different small corpora of your 
choice (you might use email text or newsgroups). Now compare the statistics 
of the two corpora. What are the differences in the most common unigrams 
between the two? How about interesting differences in bigrams? 

6.5 Add an option to your program to generate random sentences. 

6.6 Add an option to your program to do Witten-Bell discounting. 

6.7 Add an option to your program to compute the entropy (or perplexity) 
of a test set. 
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6.8 Suppose someone took all the words in a sentence and reordered them 
randomly. Write a program which take as input such a bag of words and ^ a 0 g r°s 
produces as output a guess at the original order. Use the Yiterbi algorithm 

and an Y-gram grammar produced by your Y-gram program (on some cor- 
pus). 

6.9 The field of authorship attribution is concerned with discovering the attribution 
author of a particular text. Authorship attribution is important in many fields, 
including history, literature, and forensic linguistics. For example Mosteller 

and Wallace (1964) applied authorship identification techniques to discover 
who wrote The Federalist papers. The Federalist papers were written in 
1787-1788 by Alexander Hamilton, John Jay and James Madison to per¬ 
suade New York to ratify the United States Constitution. They were pub¬ 
lished anonymously, and as a result, although some of the 85 essays were 
clearly attributable to one author or another, the authorship of 12 were in 
dispute between Hamilton and Madison. Foster (1989) applied authorship 
identification techniques to suggest that W.S.’s Funeral Elegy for William 
Peter was probably written by William Shakespeare, and that the anonymous 
author of Primary Colors the roman a clef about the Clinton campaign for 
the American presidency, was journalist Joe Klein (Foster, 1996). 

A standard technique for authorship attribution, first used by Mosteller 
and Wallace, is a Bayesian approach. For example, they trained a proba¬ 
bilistic model of the writing of Hamilton, and another model of the writings 
of Madison, and computed the maximum-likelihood author for each of the 
disputed essays. There are many complex factors that go into these models, 
including vocabulary use, word-length, syllable structure, rhyme, grammar; 
see (Holmes, 1994) for a summary. This approach can also be used for iden¬ 
tifying which genre a a text comes from. 

One factor in many models is the use of rare words. As a simple ap¬ 
proximation to this one factor, apply the Bayesian method to the attribution 
of any particular text. You will need 3 things: (1) a text to test, (2) two po¬ 
tential authors or genres, with a large on-line text sample of each. One of 
them should be the correct author. Train a unigram language model on each 
of the candidate authors. You are only going to use the singleton unigrams 
in each language model. You will compute P{T\A\), the probability of the 
text given author or genre Ai, by (1) taking the language model from Ai, 

(2) by multiplying together the the probabilities of all the unigrams that only 
occur once in the ‘unknown’ text and (3) taking the geometric mean of these 
(i.e. the nth root, where n is the number of probabilities you multiplied). 
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Do the same for A 2 . Choose whichever is higher. Did it produce the correct 
candidate? 




HMMS AND SPEECH 
RECOGNITION 


When Frederic was a little lad he proved so brave and daring, 

His father thought he’d ’prentice him to some career seafaring. 

I was, alas! his nurs’rymaid, and so it fell to my lot 
To take and bind the promising boy apprentice to a pilot - 
A life not bad for a hardy lad, though surely not a high lot, 

Though I’m a nurse, you might do worse than make your boy a pilot. 
I was a stupid nurs’rymaid, on breakers always steering, 

And I did not catch the word aright, through being hard of hearing; 
Mistaking my instructions, which within my brain did gyrate, 

I took and bound this promising boy apprentice to a pirate. 

The Pirates of Penzance, Gilbert and Sullivan, 1877 


Alas, this mistake by nurserymaid Ruth led to Frederic’s long indenture as a 
pirate and, due to a slight complication involving twenty-first birthdays and 
leap years, nearly led to 63 extra years of apprenticeship. The mistake was 
quite natural, in a Gilbert-and-Sullivan sort of way; as Ruth later noted, “The 
two words were so much alike!”. True, true; spoken language understanding 
is a difficult task, and it is remarkable that humans do as well at it as we do. 
The goal of automatic speech recognition (ASR) research is to address this 
problem computationally by building systems which map from an acoustic 
signal to a string of words. Automatic speech understanding (ASU) extends 
this goal to producing some sort of understanding of the sentence, rather than 
just the words. 

The general problem of automatic transcription of speech by any speaker 
in any environment is still far from solved. But recent years have seen ASR 
technology mature to the point where it is viable in certain limited domains. 
One major application area is in human-computer interaction. While many 
tasks are better solved with visual or pointing interfaces, speech has the po- 
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tential to be a better interface than the keyboard for tasks where full natural 
language communication is useful, or for which keyboards are not appro¬ 
priate. This includes hands-busy or eyes-busy applications, such as where 
the user has objects to manipulate or equipment to control. Another impor¬ 
tant application area is telephony, where speech recognition is already used 
for example for entering digits, recognizing ”yes” to accept collect calls, or 
call-routing (“Accounting, please”, “Prof. Landauer, please”). Finally, ASR 
is being applied to dictation, i.e. transcription of extended monologue by 
a single specific speaker. Dictation is common in fields such as law and is 
also important as part of augmentative communication (interaction between 
computers and humans with some disability resulting in the inability to type, 
or the inability to speak). The blind Milton famously dictated Paradise Lost 
to his daughters, and Henry James dictated his later novels after a repetitive 
stress injury. 

Different applications of speech technology necessarily place different 
constraints on the problem and lead to different algorithms. We chose to fo¬ 
cus this chapter on the fundamentals of one crucial area: Large-Vocabulary 
lvcsr Continuous Speech Recognition (LVCSR), with a small section on acous¬ 
tic issues in speech synthesis. Large-vocabulary generally means that the 
systems have a vocabulary of roughly 5,000 to 60,000 words. The term con- 
continuous tinuous means that the words are run together naturally; it contrasts with 
w8rd ted ' isolated-word speech recognition, in which each word must be preceded 
and followed by a pause. Furthermore, the algorithms we will discuss are 
^dependent generally speaker-independent; that is, they are able to recognize speech 
from people whose speech the system has never been exposed to before. 

The chapter begins with an overview of speech recognition architec¬ 
ture, and then proceeds to introduce the HMM, the use of the Viterbi and 
A* algorithms for decoding, speech acoustics and features, and the use of 
Gaussians and MLPs to compute acoustic probabilities. Even relying on the 
previous three chapters, summarizing this much of the field in this chapter 
requires us to omit many crucial areas; the reader is encouraged to see the 
suggested readings at the end of the chapter for useful textbooks and articles. 
This chapter also includes a short section on the acoustic component of the 
speech synthesis algorithms discussed in Chapter 4. 
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7.1 Speech Recognition Architecture 


Previous chapters have introduced many of the core algorithms used in speech 
recognition. Chapter 4 introduced the notions of phone and syllable. Chap¬ 
ter 5 introduced the noisy channel model, the use of the Bayes rule, and 
the probabilistic automaton. Chapter 6 introduced the A-gram language 
model and the perplexity metric. In this chapter we introduce the remaining 
components of a modem speech recognizer: the Hidden Markov Model 
(HMM), the idea of spectral features, the forward-backward algorithm 
for HMM training, and the Viterbi and stack decoding (also called A* de¬ 
coding algorithms for solving the decoding problem: mapping from strings ^coding 
of phone probability vectors to strings of words. 

Let’s begin by revisiting the noisy channel model that we saw in Chap¬ 
ter 5. Speech recognition systems treat the acoustic input as if it were a 
‘noisy’ version of the source sentence. In order to ‘decode’ this noisy sen¬ 
tence, we consider all possible sentences, and for each one we compute 
the probability of it generating the noisy sentence. We then chose the sen¬ 
tence with the maximum probability. Figure 7.1 shows this noisy-channel 
metaphor. 


source 

sentence 



noisy 

sentence 



guess at 
decoder . original 

? Alice was beginning 1o gel\^ sentence 

?Every happy family... N. 

?ln a hole in the ground... Ifmusicbethe 

?lf music be fhe food of love... / food of love... 
?lf music be fhe foot of dove../ 


Figure 7.1 The noisy channel model applied to entire sentences (Figure 5.1 
showed its application to individual words). Modem speech recognizers work 
by searching through a huge space of potential ‘source’ sentences and choos¬ 
ing the one which has the highest probability of generating the ‘noisy’ sen¬ 
tence. To do this they must have models that express the probability of 
sentences being realized as certain strings of words (W-grains), models that 
express the probability of words being realized as certain strings of phones 
(HMMs) and models that express the probability of phones being realized as 
acoustic or spectral features (Gaussians/MLPs). 


Implementing the noisy-channel model as we have expressed it in Fig¬ 
ure 7.1 requires solutions to two problems. First, in order to pick the sentence 
that best matches the noisy input we will need a complete metric for a “best 
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match”. Because speech is so variable, an acoustic input sentence will never 
exactly match any model we have for this sentence. As we have suggested 
in previous chapters, we will use probability as our metric, and will show 
how to combine the various probabilistic estimators to get a complete esti¬ 
mate for the probability of a noisy observation-sequence given a candidate 
sentence. Second, since the set of all English sentences is huge, we need 
an efficient algorithm that will not search through all possible sentences, but 
only ones that have a good chance of matching the input. This is the decod¬ 
ing or search problem, and we will summarize two approaches: the Viterbi 
or dynamic programming decoder, and the stack or A* decoder. 

In the rest of this introduction we will introduce the probabilistic or 
Bayesian model for speech recognition (or more accurately re-introduce it, 
since we first used the model in our discussions of spelling and pronunciation 
in Chapter 5); we leave discussion of decoding/search for pages 242-249. 

The goal of the probabilistic noisy channel architecture for speech 
recognition can be summarized as follows: 

“ What is the most likely sentence out of all sentences in the lan¬ 
guage L given some acoustic input O?” 

We can treat the acoustic input O as a sequence of individual ‘symbols’ 
or ‘observations’ (for example by slicing up the input every 10 milliseconds, 
and representing each slice by floating-point values of the energy or fre¬ 
quencies of that slice). Each index then represents some time interval, and 
successive Oi indicate temporally consecutive slices of the input (note that 
capital letters will stand for sequences of symbols and lower-case letters for 
individual symbols): 

O = 0i,02,0 3 ,...,0f C 7 * 1 ) 

Similarly, we will treat a sentence as if it were composed simply of a 
string of words: 

W = wi,w 2 ,w 3 ,...,w tt (7.2) 

Both of these are simplifying assumptions; for example dividing sen¬ 
tences into words is sometimes too fine a division (we’d like to model facts 
about groups of words rather than individual words) and sometimes too gross 
a division (we’d like to talk about morphology). Usually in speech recogni¬ 
tion a word is defined by orthography (after mapping every word to lower¬ 
case): oak is treated as a different word than oaks , but the auxiliary can (“can 
you tell me... ”) is treated as the same word as the noun can (“i need a can 
of... ” ). Recent ASR research has begun to focus on building more so- 




Section 7.1. Speech Recognition Architecture 


237 


phisticated models of ASR words incorporating the morphological insights 
of Chapter 3 and the part-of-speech information that we will study in Chap¬ 
ter 8. 

The probabilistic implementation of our intuition above, then, can be 
expressed as follows: 

W = argmax/>(W|0) (7.3) 

WeL 


Recall that the function argmax x /(x) means The x such that f(x) is 
largest’. Equation (7.3) is guaranteed to give us the optimal sentence W; we 
now need to make the equation operational. That is, for a given sentence W 
and acoustic sequence O we need to compute P{W\0). Recall that given any 
probability P(x|y), we can use Bayes’ rule to break it down as follows: 


P{Ay)= r_m 


(7.4) 


We saw in Chapter 5 that we can substitute (7.4) into (7.3) as follows: 


W = argmax 
WeL 


P(0\W)P{W) 

W) 


(7.5) 


The probabilities on the right hand of (7.5) are for the most part easier 
to compute than P{W\0). For example, P{W ), the prior probability of the 
word string itself is exactly what is estimated by the n -gram language mod¬ 
els of Chapter 6. And we will see below that P{0\W) turns out to be easy 
to estimate as well. But P(0 ), the probability of the acoustic observation 
sequence, turns out to be harder to estimate. Luckily, we can ignore P(O) 
just as we saw in Chapter 5. Why? Since we are maximizing over all pos¬ 
sible sentences, we will be computing f or cac h sentence in the 

language. But P{0) doesn’t change for each sentence! For each potential 
sentence we are still examining the same observations O , which must have 
the same probability P(0). Thus: 

W = argmax = 2Xgmd ^p[o\W)P{W) (7.6) 

WeL P\0) WeL 


To summarize, the most probable sentence W given some observation 
sequence O can be computing by taking the product of two probabilities for 
each sentence, and choosing the sentence for which this product is greatest. 
These two terms have names; P(W ), the prior probability, is called the lan¬ 
guage model. P(0\W), the observation likelihood, is called the acoustic 
model. 


LANGUAGE 

MODEL 

ACOUSTIC 

MODEL 
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likelihood prior 

Key Concept #5. W = argmax P(0\W) P{W) (7.7) 

weL 

We have already seen in Chapter 6 how to compute the language model 
prior P(W) by using A-gram grammars. The rest of this chapter will show 
how to compute the acoustic model P(0\W), in two steps. First we will 
make the simplifying assumption that the input sequence is a sequence of 
phones F rather than a sequence of acoustic observations. Recall that we 
introduced the forward algorithm in Chapter 5, which was given ‘obser¬ 
vations’ that were strings of phones, and produced the probability of these 
phone observations given a single word. We will show that these probabilis¬ 
tic phone automata are really a special case of the Hidden Markov Model, 
and we will show how to extend these models to give the probability of a 
phone sequence given an entire sentence. 

One problem with the forward algorithm as we presented it was that 
in order to know which word was the most-likely word (the ‘decoding prob¬ 
lem’), we had to run the forward algorithm again for each word. This is 
clearly intractable for sentences; we can’t possibly run the forward algo¬ 
rithm separately for each possible sentence of English. We will thus intro¬ 
duce two different algorithms which simultaneously compute the likelihood 
of an observation sequence given each sentence, and give us the most-likely 
sentence. These are the Viterbi and the A* decoding algorithms. 

Once we have solved the likelihood-computation and decoding prob¬ 
lems for a simplified input consisting of strings of phones, we will show 
how the same algorithms can be applied to true acoustic input rather than 
pre-defined phones. This will involve a quick introduction to acoustic input 
and feature extraction, the process of deriving meaningful features from 
the input soundwave. Then we will introduce the two standard models for 
computing phone-probabilities from these features: Gaussian models, and 
neural net (multi-layer perceptrons) models. 

Finally, we will introduce the standard algorithm for training the Hid¬ 
den Markov Models and the phone-probability estimators, the forward- 
backward or Baum-Welch algorithm) (Baum, 1972), a special case of the 
the Expectation-Maximization or EM algorithm (Dempster et a/., 1977). 

As a preview of the chapter, Figure 7.2 shows an outline of the compo¬ 
nents of a speech recognition system. The figure shows a speech recognition 
system broken down into three stages. In the signal processing or feature 
extraction stage, the acoustic waveform is sliced up into frames (usually 
of 10, 15, or 20 milliseconds) which are transformed into spectral features 
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which give information about how much energy in the signal is at different 
frequencies* In the subword or phone recognition stage, we use statistical 
techniques like neural networks or Gaussian models to tentatively recognize 
individual speech sounds like p or b. For a neural network, the output of 
this stage is a vector of probabilities over phones for each frame (i.c. ‘for 
this frame the probability of [p] is ,8, the probability of [b] is *1, the proba¬ 
bility of I f] is .02, etc'); for a Gaussian model the probabilities are slightly 
different. Finally, in the decoding stage, we take a dictionary of word pro¬ 
nunciations and a language model (probabilistic grammar) and use a Vitcrbi 
or decoder to find the sequence of words which has the highest proba¬ 
bility given the acoustic events. 
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Figure 7.2 Schematic architecture for a (simplified) speech recognizer 


7.2 Overview of Hidden Markov Models 

In Chapter 5 we used weighted finite-state automata or Markov chains to 
model the pronunciation of words. The automata consisted of a sequence 
of states q — {qi)q\qi .. each corresponding to a phone, and a set of 
transition probabilities between states, encoding the probability 

of one phone following another. We represented the states as nodes, and 
the transition probabilities as edges between nodes; an edge existed between 
two nodes if there was a non-zero transition probability between the two 
nodes* We also saw that we could use the forward algorithm to compute the 
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likelihood of a sequence of observed phones o = ( 0 i<? 2<?3 • ■■£*)• Figure 7.3 
shows an automaton for the word need with sample observation sequence of 
the kind we saw in Chapter 5. 


a 24 = .11 

Word Model 

Observation 
Sequence 
(phone symbols) 


Figure 7.3 A simple weighted automaton or Markov chain pronunciation 
network for the word need , showing the transition probabilities, and a sample 
observation sequence. The transition probabilities a xy between two states x 
and y are 1.0 unless otherwise specified. 



Oi 0 2 0 3 


While we will see that these models figure importantly in speech recog¬ 
nition, they simplify the problem in two ways. First, they assume that the 
input consists of a sequence of symbols! Obviously this is not true in the 
real world, where speech input consists essentially of small movements of 
air particles. In speech recognition, the input is an ambiguous, real-valued 
representation of the sliced-up input signal, called features or spectral fea¬ 
tures. We will study the details of some of these features beginning on 
page 258; acoustic features represent such information as how much energy 
there is at different frequencies. The second simplifying assumption of the 
weighted automata of Chapter 5 was that the input symbols correspond ex¬ 
actly to the states of the machine. Thus when seeing an input symbol [b], 
we knew that we could move into a state labeled [b]. In a Hidden Markov 
markov Model, by contrast, we can’t look at the input symbols and know which state 
to move to. The input symbols don’t uniquely determine the next state. 1 

Recall that a weighted automaton or simple Markov model is specified 
by the set of states Q , the set of transition probabilities A, a defined start 
state and end state(s), and a set of observation likelihoods B. For weighted 

1 Actually, as we mentioned in passing, by this second criterion some of the automata we 
saw in Chapter 5 were technically HMMs as well. This is because the first symbol in the 
input string [n iy] was compatible with the [n] states in the words need or an. Seeing the 
symbols [n], we didn’t know which underlying state it was generated by, need-n or an-n. 
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automata, we defined the probabilities hi{o t ) as 1.0 if the state / matched the 
observation Oi and 0 if they didn't match. An HMM formally differs from a 
Markov model by adding two more requirements. First, it has a separate set 
of observation symbols O, which is not drawn from the same alphabet as the 
state set Q. Second, the observation likelihood function B is not limited to 
the values 1.0 andO; in an HMM the probability bi(o t ) can take on any value 
from 0 to l .0. 


^24 


Word Model 


Observation 
Sequence 
(spectral feature 
vectors) 
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Figure 7,4 An HMM pronunciation network for the word need, showing 
the transition probabilities, and a sample observation sequence. Note the ad¬ 
dition of the ouLput probabilities B, HMMs used in speech recognition usually 
use self-loops on the states to model variable phone durations. 


Figure 7.4 shows an HMM for the word need and a sample observa¬ 
tion sequence. Note the differences from Figure 7.3. First, the observation 
sequences arc now vectors of spectral features representing the speech sig¬ 
nal. Next, note that we've also allowed one state to generate multiple copies 
of the same observation, by having a loop on the state. This loops allows 
HMMs to model the variable duration of phones; longer phones require more 
loops through Ihe HMM. 

In summary, here are the parameters we need to define an HMM: 

* states; A set of states Q — q\q 2 * *. 

* transition probabilities; A set of probabilities A — oqi am *. * a n \... 
Each a t j represents the probability of transitioning from state i to state 
j. The set of these is the transition probability matrix*;, 

* observation likelihoods: A set of observation likelihoods B = bi(o t ), 
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each expressing the probability of an observation o t being generated 
from a state i. 

In our examples so far we have used two ‘special’ states (non-emitting 
states) as the start and end state; as we saw in Chapter 5 it is also possible to 
avoid the use of these states by specifying two more things: 

• initial distribution: An initial probability distribution over states, n, 
such that %i is the probability that the HMM will start in state i. Of 
course some states j may have %j = 0, meaning that they cannot be 
initial states. 

• accepting states: A set of legal accepting states. 

As was true for the weighted automata, the sequences of symbols that 
are input to the model (if we are thinking of it as recognizer) or which are 
produced by the model (if we are thinking of it as a generator) are generally 
called the observation sequence, referred to as O = ( 01 O 20 3 ... or)- 

13 The Viterbi Algorithm Revisited 

Chapter 5 showed how the forward algorithm could be used to compute the 
probability of an observation sequence given an automaton, and how the 
Viterbi algorithm can be used to find the most-likely path through the au¬ 
tomaton, as well as the probability of the observation sequence given this 
most-likely path. In Chapter 5 the observation sequences consisted of a 
single word. But in continuous speech, the input consists of sequences of 
words, and we are not given the location of the word boundaries. Knowing 
where the word boundaries are massively simplifies the problem of pronun¬ 
ciation; in Chapter 5 since we were sure that the pronunciation [ni] came 
from one word, we only had 7 candidates to compare. But in actual speech 
we don’t know where the word boundaries are. For example, try to decode 
the following sentence from Switchboard (don’t peek ahead!): 

[ay d ih s hh er d s ah m th ih ng ax b aw m uh v ih ng r ih s en 1 ih] 

The answer is in the footnote . 2 The task is hard partly because of coar¬ 
ticulation and fast speech (e.g. [d] for the first phone of justl). But mainly 
it’s the lack of spaces indicating word boundaries that make the task difficult. 
The task of finding word boundaries in connected speech is called segmen¬ 
tation and we will solve it by using the Viterbi algorithm just as we did for 

2 I just heard something about moving recently. 
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Chinese word-segmentation in Chapter 5; Recall that the algorithm for Chi¬ 
nese word-segmentation relied on choosing the segmentation that resulted 
in the sequence of words with the highest frequency. For speech segmenta¬ 
tion we use the more sophisticated Y-gram language models introduced in 
Chapter 6. In the rest of this section we show how the Viterbi algorithm can 
be applied to the task of decoding and segmentation of a simple string of 
observations phones, using an n-gmm language model. We will show how 
the algorithm is used to segment a very simple string of words. Here’s the 
input and output we will work with: 

Input Output 

[aa n iy dh ax] I need the 

Figure 7.5 shows word models for /, need , the , and also, just to make 
things difficult, the word on. 



Word model for "on" 



Word model for "need" 


aa 

Word model for "I" 


Figure 7.5 Pronunciation networks for the words /, on , need , and the. All 
networks (especially the ) are significantly simplified. 


Recall that the goal of the Viterbi algorithm is to find the best state se¬ 
quence q = (qiq 2 Q 3 ---qt) given the set of observed phones o = {o\ O 2 O 3 ...o t ). 
A graphic illustration of the output of the dynamic programming algorithm is 
shown in Figure 7.6. Along the y-axis are all the words in the lexicon; inside 
each word are its states. The x-axis is ordered by time, with one observed 
phone per time unit. 3 Each cell in the matrix will contain the probability of 

3 This x-axis component of the model is simplified in two major ways that we will show 
how to fix in the next section. First, the observations will not be phones but extracted spectral 
features, and second, each phone consists of not time unit observation but many observations 
(since phones can last for more than one phone). The y-axis is also simplified in this example, 
since as we will see most ASR system use multiple ‘subphone’ units for each phone. 
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the most-likely sequence ending at that state. We can find the most-likely 
state sequence for the entire observation string by looking at the cell in the 
right-most column that has the highest-probability, and tracing back the se¬ 
quence that produced it. 



Figure 7.6 An illustration of the results of the Yiterbi algorithm used to 
find the most-likely phone sequence (and hence estimate the most-likely word 
sequence). 


DYNAMIC 

PROGRAM¬ 

MING 

INVARIANT 


More formally, we are searching for the best state sequence q* = {q\qi ... qr), 
given an observation sequence o = (o\ 02 ... 0 t) and a model (a weighted au¬ 
tomaton or ‘state graph’) X. Each cell viterbtfi, t ]of the matrix contains the 
probability of the best path which accounts for the first t observations and 
ends in state i of the HMM. This is the most-probable path out of all possible 
sequences of states of length t— 1: 

viterbi[t , i] max P{q\qi ... q t ~\ , q t = i, 0 \ , 02 • • • o t |A,) (7.8) 

In order to compute viterbi[t,i 7, the Viterbi algorithm assumes the dy¬ 
namic programming invariant. This is the simplifying (but incorrect) as¬ 
sumption that if the ultimate best path for the entire observation sequence 
happens to go through a state qu that this best path must include the best 
path up to and including state qi. This doesn’t mean that the best path at any 
time t is the best path for the whole sequence. A path can look bad at the 
beginning but turn out to be the best path. As we will see later, the Viterbi 
assumption breaks down for certain kinds of grammars (including trigram 
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grammars) and so some recognizers have moved to another kind of decoder, 
the stack or A* decoder; more on that later. As we saw in our discussion 
of the minimum-edit-distance algorithm in Chapter 5, the reason for making 
the Viterbi assumption is that it allows us to break down the computation of 
the optimal path probability in a simple way; each of the best paths at time t 
is the best extension of each of the paths ending at time t— 1. In other words, 
the recurrence relation for the best path at time t ending in state j, viterbi[t,j], 
is the maximum of the possible extensions of every possible previous path 
from time t— 1 to time t: 

viterbi[tj\ ma x(viterbi[t — 1 ,i\aij)bj(o t ) (7.9) 

i 

The algorithm as we describe it in Figure 7.9 takes a sequence of ob¬ 
servations, and a single probabilistic automaton, and returns the optimal path 
through the automaton. Since the algorithm requires a single automaton, we 
will need to combine the different probabilistic phone networks for the , /, 
need , and a into one automaton. In order to build this new automaton we 
will need to add arcs with probabilities between any two words: bigram 
probabilities. Figure 7.7 shows simple bigram probabilities computed from 
the combined Brown and Switchboard corpus. 
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Figure 7.7 

Bigram probabilities for the words the. 

on, need , and I following 

each other, and starting a sentence (i.e. 

following #). Computed from the 

combined Brown and Switchboard corpora with add-0.5 smoothing. 


Figure 7.8 shows the combined pronunciation networks for the 4 words 
together with a few of the new arcs with the bigram probabilities. For read¬ 
ability of the diagram, most of the arcs aren’t shown; the reader should imag¬ 
ine that each probability in Figure 7.7 is inserted as an arc between every two 
words. 

The algorithm is given in Figure 5.19 in Chapter 5, and is repeated 
here for convenience as Figure 7.9. We see in Figure 7.9 that the Yiterbi 
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algorithm sets up a probability matrix, with one column for each time index 
t and one row for each state in the state graph. The algorithm first creates 
T + 2 columns; Figure 7.9 shows the first 6 columns. The first column is 
an initial pseudo-observation, the next corresponds to the first observation 
phone [aa], and so on. We begin in the first column by setting the probability 
of the start state to 1.0, and the other probabilities to 0; the reader should 
find this in Figure 7.10. Cells with probability 0 are simply left blank for 
readability. For each column of the matrix, i.e. for each time index t , each 
cell viterbi[t y j 7, will contain the probability of the most likely path to end in 
that cell. We will calculate this probability recursively, by maximizing over 
the probability of coming from all possible preceding states. Then we move 
to the next state; for each of the i states viterbi[0,i] in column 0, we compute 
the probability of moving into each of the j states viterbi[l,j] in column 1, 
according to the recurrence relation in (7.9). In the column for the input aa , 
only two cells have non-zero entries, since b\ (aa) is zero for every other 
state except the two states labeled aa. The value of viterbi(l,aa) of the word 
/ is the product of the transition probability from # to / and the probability of 
/ being pronounced with the vowel aa. 

Notice that if we look at the column for the observation //, that the word 
on is currently the ‘most-probable’ word. But since there is no word or set of 
words in this lexicon which is pronounced i dh ax , the path starting with on 
is a dead end, i.e. this hypothesis can never be extended to cover the whole 
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function Yiterbi {observations of len T,state-graph) returns best-path 

num-states num-of-s tates ( state-graph) 

Create a path probability matrix viterbi[num-states+2,T+2] 

viterbi[0,0] d- 1.0 

for each time step t from 0 to 7 do 

for each state s from 0 to num-states do 

for each transition s f from s specified by state-graph 
new-scored-viterbi[s 7 1] * a[s,s r ] * b s >(o t ) 
if ((viterbi[s f ,t+l] = 0) || (new-score > viterbi[s f , /+7])) 
then 

viterbi[s f , t+l\d- new-score 
back-pointer^, tdl]d-s 

Backtrace from highest probability state in the final column of viterbi[] and 
return path 


Figure 7.9 Yiterbi algorithm for finding optimal sequence of states in con¬ 
tinuous speech recognition, simplified by using phones as inputs (duplicate of 
Figure 5.19). Given an observation sequence of phones and a weighted au¬ 
tomaton (state graph), the algorithm returns the path through the automaton 
which has minimum probability and accepts the observation sequence. 
is the transition probability from current state s to next state s’ and b s >(o t ) is 
the observation likelihood of s’ given o t . 


utterance. 

By the time we see the observation iy , there are two competing paths: 
I need and I the; I need is currently more likely. When we get to the obser¬ 
vation dh , we could have arrived from either the iy of need or the iy of the. 
The probability of the max of these two paths, in this case the path through / 
need , will go into the cell for dh. 

Finally, the probability for the best path will appear in the final ax 
column. In this example, only one cell is non-zero in this column; the ax 
state of the word the (a real example wouldn’t be this simple; many other 
cells would be non-zero). 

If the sentence had actually ended here, we would now need to back¬ 
trace to find the path that gave us this probability. We can’t just pick the 
highest probability state for each state column. Why not? Because the most 
likely path early on is not necessarily the most likely path for the whole sen¬ 
tence. Recall that the most likely path after seeing n was the word on. But 
the most likely path for the whole sentence is I need the. Thus we had to 
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# aa n iy dh ax 


Figure 7.10 The entries in the individual state columns for the Yiterbi al¬ 
gorithm. Each cell keeps the probability of the best path so far and a pointer 
to the previous cell along that path. Backtracing from the successful last word 
(the), we can reconstruct the word sequence I need the . 


rely in Figure 7.10 on the ‘Hansel and GreteT method (or the ‘Jason and 
the Minotaur’ method if you like your metaphors more classical): whenever 
we moved into a cell, we kept pointers back to the cell we came from. The 
reader should convince themselves that the Viterbi algorithm has simultane¬ 
ously solved the segmentation and decoding problems. 

The presentation of the Viterbi algorithm in this section has been sim¬ 
plified; actual implementations of Yiterbi decoding are more complex in 
three key ways that we have mentioned already. First, in an actual HMM 
for speech recognition, the input would not be phones. Instead, the input 
is a feature vector of spectral and acoustic features. Thus the observation 
likelihood probabilities bi(t) of an observation o t given a state i will not 
simply take on the values 0 or 1, but will be more fine-grained probability 
estimates, computed via mixtures of Gaussian probability estimators or neu¬ 
ral nets. The next section will show how these probabilities are computed. 

Second, the HMM states in most speech recognition systems are not 
simple phones but rather subphones. In these systems each phone is divided 
into 3 states: the beginning, middle and final portions of the phone. Dividing 
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up a phone in this way captures the intuition that the significant changes in 
the acoustic input happen at a finer granularity than the phone; for exam¬ 
ple the closure and release of a stop consonant. Furthermore, many systems 
use a separate instance of each of these subphones for each triphone con- triphone 
text (Schwartz et al , 1985; Deng et al , 1990). Thus instead of around 60 
phone units, there could be as many as 60 3 context-dependent triphones. In 
practice, many possible sequences of phones never occur or are very rare, 
so systems create a much smaller number of triphones models by clustering 
the possible triphones (Young and Woodland, 1994). Figure 7.11 shows an 
example of the complete phone model for the triphone b(ax,aw). 



Figure 7.11 An example of the context-dependent triphone b(ax,aw) (the 
phone [b] preceded by a [ax] and followed by a [aw], as in the beginning of 
about , showing its left, middle, and right subphones. 


Finally, in practice in large-vocabulary recognition it is too expensive 
to consider all possible words when the algorithm is extending paths from 
one state-column to the next. Instead, low-probability paths are pruned at 
each time step and not extended to the next state column. This is usually im¬ 
plemented via beam search: for each state column (time step), the algorithm p^ CH 
maintains a short list of high-probability words whose path probabilities are 
within some percentage (beam width) of the most probable word path. Only beam width 
transitions from these words are extended when moving to the next time step. 

Since the words are ranked by the probability of the path so far, which words 
are within the beam (active) will change from time step to time step. Making 
this beam search approximation allows a significant speed-up at the cost of 
a degradation to the decoding performance. This beam search strategy was 
first implemented by Lowerre (1968). Because in practice most implemen¬ 
tations of Viterbi use beam search, some of the literature uses the term beam 
search or time-synchronous beam search instead of Viterbi. 
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7.4 Advanced Methods for Decoding 


There are two main limitations of the Yiterbi decoder. First, the Viterbi 
decoder does not actually compute the sequence of words which is most 
probable given the input acoustics. Instead, it computes an approximation to 
this: the sequence of states (i.e. phones or subphones ) which is most proba¬ 
ble given the input. This difference may not always be important; the most 
probable sequence of phones may very well correspond exactly to the most 
probable sequence of words. But sometimes the most probable sequence 
of phones does not correspond to the most probable word sequence. For 
example consider a speech recognition system whose lexicon has multiple 
pronunciations for each word. Suppose the correct word sequence includes 
a word with very many pronunciations. Since the probabilities leaving the 
start arc of each word must sum to 1.0, each of these pronunciation-paths 
through this multiple-pronunciation HMM word model will have a smaller 
probability than the path through a word with only a single pronunciation 
path. Thus because the Viterbi decoder can only follow one of these pronun¬ 
ciation paths, it may ignore this word in favor of an incorrect word with only 
one pronunciation path. 

A second problem with the Viterbi decoder is that it cannot be used 
with all possible language models. In fact, the Viterbi algorithm as we have 
defined it cannot take complete advantage of any language model more com¬ 
plex than a bigram grammar. This is because of the fact mentioned early that 
a trigram grammar, for example, violates the dynamic programming in¬ 
variant that makes dynamic programming algorithms possible. Recall that 
this invariant is the simplifying (but incorrect) assumption that if the ultimate 
best path for the entire observation sequence happens to go through a state 
qu that this best path must include the best path up to and including state 
qi. Since a trigram grammar allows the probability of a word to be based on 
the two previous words, it is possible that the best trigram-probability path 
for the sentence may go through a word but not include the best path to that 
word. Such a situation could occur if a particular word w x has a high tri¬ 
gram probability given w y .w z , but that conversely the best path to w y didn’t 
include w z (i.e. P{w y \w q ^w z ) was low for all q). 

There are two classes of solutions to these problems with Viterbi de¬ 
coding. One class involves modifying the Viterbi decoder to return mul¬ 
tiple potential utterances and then using other high-level language model 
or pronunciation-modeling algorithms to re-rank these multiple outputs. In 
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general this kind of multiple-pass decoding allows a computationally effi¬ 
cient, but perhaps unsophisticated, language model like a bigram to perform 
a rough first decoding pass, allowing more sophisticated but slower decoding 
algorithms to run on a reduced search space. 

For example, Schwartz and Chow (1990) give a Viterbi-like algorithm 
which returns the A-best sentences (word sequences) for a given speech in- A-best 
put. Suppose for example a bigram grammar is used with this A-best-Yiterbi 
to return the 10,000 most highly-probable sentences, each with their likeli¬ 
hood score. A trigram-grammar can then be used to assign a new language- 
model prior probability to each of these sentences. These priors can be com¬ 
bined with the acoustic likelihood of each sentence to generate a posterior 
probability for each sentence. Sentences can then be rescored using this rescored 
more sophisticated probability.Figure 7.12 shows an intuition for this algo¬ 
rithm. 
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Source Source 
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Figure 7.12 The use of A-best decoding as part of a two-stage decoding 
model. Efficient but unsophisticated knowledge sources are used to return the 
A-best utterances. This significantly reduces the search space for the second 
pass models, which are thus free to be very sophisticated but slow. 


An augmentation of A-best, still part of this first class of extensions to 
Yiterbi, is to return, not a list of sentences, but a word lattice. A word lattice Entice 
is a directed graph of words and links between them which can compactly 
encode a large number of possible sentences. Each word in the lattice is aug¬ 
mented with its observation likelihood, so that any particular path through 
the lattice can then be combined with the prior probability derived from a 
more sophisticated language model. For example Murveit et al (1993) de¬ 
scribe an algorithm used in the SRI recognizer Decipher which uses a bigram 
grammar in a rough first pass, producing a word lattice which is then refined 
by a more sophisticated language model. 

The second solution to the problems with Yiterbi decoding is to employ 
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a completely different decoding algorithm. The most common alternative 
decoder algorithm is the stack decoder, also called the A* decoder (Jelinek, 1969; 
a* Jelinek et al , 1975). We will describe the algorithm in terms of the A* 

a* search search used in the artificial intelligence literature, although the development 

of stack decoding actually came from the communications theory literature 
and the link with AI best-first search was noticed only later (Jelinek, 1976). 

A* Decoding 

To see how the A* decoding method works, we need to revisit the Yiterbi al¬ 
gorithm. Recall that the Yiterbi algorithm computed an approximation of the 
forward algorithm. Yiterbi computes the observation likelihood of the single 
best (MAX) path through the HMM, while the forward algorithm computes 
the observation likelihood of the total (SUM) of all the paths through the 
HMM. But we accepted this approximation because Viterbi computed this 
likelihood and searched for the optimal path simultaneously. The A* decod¬ 
ing algorithm, on the other hand, will rely on the complete forward algorithm 
rather than an approximation. This will ensure that we compute the correct 
observation likelihood. Furthermore, the A* decoding algorithm allows us 
to use any arbitrary language model. 

The A* decoding algorithm is a kind of best-first search of the lattice or 
tree which implicitly defines the sequence of allowable words in a language. 
Consider the tree in Figure 7.13, rooted in the START node on the left. Each 
leaf of this tree defines one sentence of the language; the one formed by 
concatenating all the words along the path from START to the leaf. We 
don’t represent this tree explicitly, but the stack decoding algorithm uses the 
tree implicitly as a way to structure the decoding search. 

The algorithm performs a search from the root of the tree toward the 
leaves, looking for the highest probability path, and hence the highest prob¬ 
ability sentence. As we proceed from root toward the leaves, each branch 
leaving a given word node represent a word which may follow the current 
word. Each of these branches has a probability, which expresses the condi¬ 
tional probability of this next word given the part of the sentence we’ve seen 
so far. In addition, we will use the forward algorithm to assign each word a 
likelihood of producing some part of the observed acoustic data. The A* de¬ 
coder must thus find the path (word sequence) from the root to a leaf which 
has the highest probability, where a path probability is defined as the prod¬ 
uct of its language model probability (prior) and its acoustic match to the 
data (likelihood). It does this by keeping a priority queue of partial paths 


PRIORITY 

QUEUE 
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(i.e. prefixes of sentences, each annotated with a score). In a priority queue 
each element has a score, and the pop operation returns the element with 
the highest score. The A* decoding algorithm iteratively chooses the best 
prefix-so-far, computes all the possible next words for that prefix, and adds 
these extended sentences to the queue. The Figure 7.14 shows the complete 
algorithm. 

Let’s consider a stylized example of a A* decoder working on a wave¬ 
form for which the correct transcription is If music be the food of love. Fig¬ 
ure 7.15 shows the search space after the decoder has examined paths of 
length one from the root. A fast match is used to select the likely next fast match 
words. A fast match is one of a class of heuristics designed to efficiently 
winnow down the number of possible following words, often by comput¬ 
ing some approximation to the forward probability (see below for further 
discussion of fast matching). 

At this point in our example, we’ve done the fast match, selected a sub¬ 
set of the possible next words, and assigned each of them a score. The word 
Alice has the highest score. We haven’t yet said exactly how the scoring 
works, although it will involve as a component the probability of the hypoth- 
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function Stack-Decoding() returns min-distance 

Initialize the priority queue with a null sentence. 

Pop the best (highest score) sentence s off the queue. 

If (.v is marked end-of-sentence (EOS)) output s and terminate. 

Get list of candidate next words by doing fast matches. 

For each candidate next word w: 

Create a new candidate sentence s + w. 

Use forward algorithm to compute acoustic likelihood L of s + w 
Compute language model probability P of extended sentence .v + vv 
Compute ‘score’ for s + w (a function of L, P, and ???) 
if (end-of-sentence) set EOS flag for s + w. 

Insert s + w into the queue together with its score and EOS flag 


Figure 7.14 The A* decoding algorithm (modified from Paul (1991) and 
Jelinek (1997)). The evaluation function that is used to compute the score for 
a sentence is not completely defined here; possibly evaluation functions are 
discussed below. 


esized sentence given the acoustic input P{W\A), which itself is composed 
of the language model probability P{W) and the acoustic likelihood P(A\W). 

Figure 7.16 show the next stage in the search. We have expanded the 
Alice node. This means that the Alice node is no longer on the queue, but its 
children are. Note that now the node labeled //actually has a higher score 
than any of the children of Alice. 

Figure 7.17 shows the state of the search after expanding the if node, 
removing it, and adding if music, if muscle, and if messy on to the queue. 

We’ve implied that the scoring criterion for a hypothesis is related to its 
probability. Indeed it might seem that the score for a string of words w\ given 
an acoustic string y\ should be the product of the prior and the likelihood: 

Alas, the score cannot be this probability because the probability will 
be much smaller for a longer path than a shorter one. This is due to a simple 
fact about probabilities and substrings; any prefix of a string must have a 
higher probability than the string itself (e.g. P(START the...) will be greater 
than P(START the book)). Thus if we used probability as the score, the A* 
decoding algorithm would get stuck on the single-word hypotheses. 

Instead, we use what is called the A* evaluation function (Nilsson, 
1980; Pearl, 1984) called /*(/?), given a partial path p: 
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P(acoustic I "if*) = 
forward probability 



Figure 7.15 The beginning of the search for the sentence If music be the 
food of love. At this early stage Alice is the most likely hypothesis (it has a 
higher score than the other hypotheses). 


P(acousticsl "if ) = 
forward probability 



Figure 7.16 The next step of the search for the sentence If music be the 
food of love. We’ve now expanded the Alice node, and added three extensions 
which have a relatively high score (yvas, wants , and walls). Note that now the 
node with the highest score is START if which is not along the START Alice 
path at all! 


f(p)=g(p)+h*(p) 
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/* ( p ) is the estimated score of the best complete path (complete sen¬ 
tence) which starts with the partial path p. In other words, it is an estimate of 
how well this path would do if we let it continue through the sentence. The 
A* algorithm builds this estimate from two components: 

• §(P ) is the score from the beginning of utterance to the end of the par¬ 
tial path p. This g function can be nicely estimated by the probability 
of p given the acoustics so far (i.e. as P{A\W)P{W) for the word string 
W constituting p). 

• h*(p) is an estimate of the best scoring extension of the partial path to 
the end of the utterance. 

Coming up with a good estimate of h* is an unsolved and interesting 
problem. One approach is to choose as h* an estimate which correlates with 
the number of words remaining in the sentence (Paul, 1991); see Jelinek 
(1997) for further discussion. 

We mentioned above that both the A* and various other two-stage de¬ 
coding algorithms require the use of a fast match for quickly finding which 
words in the lexicon are likely candidates for matching some portion of the 
acoustic input. Many fast match algorithms are based on the use of a tree- 
structured structured lexicon, which stores the pronunciations of all the words in such 
a way that the computation of the forward probability can be shared for 
words which start with the same sequence of phones. The tree-structured 
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lexicon was first suggested by Klovstad and Mondshein (1975); fast match 
algorithms which make use of it include Gupta et al (1988), Bahl et al 
(1992) in the context of A* decoding, and Ney et al (1992) and Nguyen and 
Schwartz (1999) in the context of Viterbi decoding. Figure 7.18 shows an 
example of a tree-structured lexicon from the Sphinx-II recognizer (Ravis- 
hankar, 1996). Each tree root represents the first phone of all words begin¬ 
ning with that context dependent phone (phone context may or may not be 
preserved across word boundaries), and each leaf is associated with a word. 



Figure 7.18 A tree-structured lexicon from the Sphinx-II recognizer (af¬ 
ter Ravishankar (1996)). Each node corresponds to a particular triphone in a 
slightly modified version of the ARPAbet; thus EY(B,KD) means the phone 
EY preceded by a B and followed by the closure of a K. 


There are many other kinds of multiple-stage search, such as the forward- 
backward search algorithm (not to be confused with the forward-backward backward 
algorithm) (Austin et al , 1991) which performs a simple forward search fol¬ 
lowed by a detailed backward (i.e. time-reversed) search. 
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7.5 Acoustic Processing of Speech 
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This section presents a very brief overview of the kind of acoustic processing 
commonly called feature extraction or signal analysis in the speech recog¬ 
nition literature. The term features refers to the vector of numbers which 
represent one time-slice of a speech signal. A number of kinds of features 
are commonly used, such as LPC features and PLP features. All of these are 
spectral features, which means that they represent the waveform in terms of 
the distribution of different frequencies which make up the waveform; such 
a distribution of frequencies is called a spectrum. We will begin with a brief 
introduction to the acoustic waveform and how it is digitized, summarize the 
idea of frequency analysis and spectra, and then sketch out different kinds of 
extracted features. This will be an extremely brief overview; the interested 
reader should refer to other books on the linguistics aspects of acoustic pho¬ 
netics (Johnson, 1997; Ladefoged, 1996) or on the engineering aspects of 
digital signal processing of speech (Rabiner and Juang, 1993). 


Sound Waves 

The input to a speech recognizer, like the input to the human ear, is a complex 
series of changes in air pressure. These changes in air pressure obviously 
originate with the speaker, and are caused by the specific way that air passes 
through the glottis and out the oral or nasal cavities. We represent sound 
waves by plotting the change in air pressure over time. One metaphor which 
sometimes helps in understanding these graphs is to imagine a vertical plate 
which is blocking the air pressure waves (perhaps in a microphone in front of 
a speaker’s mouth, or the eardrum in a hearer’s ear). The graph measures the 
amount of compression or rarefaction (uncompression) of the air molecules 
at this plate. Figure 7.19 shows the waveform taken from the Switchboard 
corpus of telephone speech of someone saying “she just had a baby”. 
frequency Two important characteristics of a wave are its frequency and ampli- 

amplitude tude. The frequency is the number of times a second that a wave repeats 

itself, or cycles. Note in Figure 7.19 that there are 28 repetitions of the wave 
in the .11 seconds we have captured. Thus the frequency of this segment of 
second per the wave is 28/. 11 or 255 cycles per second. Cycles per second are usually 

hertz called Hertz (shortened to Hz), so the frequency in Figure 7.19 would be 

described as 255 Hz. 

The vertical axis in Figure 7.19 measures the amount of air pressure 
amplitude variation. A high value on the vertical axis (a high amplitude) indicates 
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Figure 7.19 A waveform of the vowel [iy] from the utterance shown in 
Figure 7.20. The y-axis shows the changes in air pressure above and below 
normal atmospheric pressure. The x-axis shows time. Notice that the wave 
repeats regularly. 


that there is more air pressure at that point in time, a zero value means there 
is normal (atmospheric) air pressure, while a negative value means there is 
lower than normal air pressure (rarefaction). 

Two important perceptual properties are related to frequency and am¬ 
plitude. The pitch of a sound is the perceptual correlate of frequency; in pitch 
general if a sound has a higher-frequency we perceive it as having a higher 
pitch, although the relationship is not linear, since human hearing has differ¬ 
ent acuities for different frequencies. Similarly, the loudness of a sound is 
the perceptual correlate of the power, which is related to the square of the 
amplitude. So sounds with higher amplitudes arc perceived as louder, but 
again the relationship is not linear. 

How to Interpret a Waveform 

Since humans (and to some extent machines) can transcribe and understand 
speech just given the sound wave, the waveform must contain enough infor¬ 
mation to make the task possible. In most cases this information is hard to 
unlock just by looking at the waveform, but such visual inspection is still 
sufficient to learn some things. For example, the difference between vowels 
and most consonants is relatively clear on a waveform. Recall that vowels 
arc voiced, lend to be long, and arc relatively loud. Length in time manifests 
itself directly as length in space on a waveform plot. Loudness manifests 
itself as high amplitude. How do wc recognize voicing? Recall that voicing 
is caused by regular openings and closing of the vocal folds. When the vocal 
folds are vibrating, we can see regular peaks in ampl itude of the kind we saw 
in Figure 7.19. During a stop consonant, for example the closure of a [p], [t], 
or [k], we should expect no peaks at all; in fact we expect silence. 

Notice in Figure 7.20 the places where there are regular amplitude 
peaks indicating voicing; from second .46 to .58 (the vowel [iy]). from sec- 
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oncl .65 to .74 (the vowel [axj) and so on. The places where there is no 
amplitude indicate the silence of a stop closure; for example from second 
1 .06 to second 1.08 (the closure for the first [b], or from second 1.26 to 1.28 
(the closure for the second [b]). 



'll ^ n-10 IIJO Tiftis nvn 0*J in 110 1 20 i M 

-d J J d J d dj d d d 11 

Figure 7.20 A waveform of the sentence “She just had a baby'* from the 
Switchboard corpus (conversation 4325). The speaker is female, was 20 years 
old in 1991 which is approximately when the recording was made, and speaks 
the South Midlands dialect of American English. The phone labels show 
where each phone ends. 


Fricatives like [sh] can also be recognized in a waveform; they produce 
an intense irregular pattern; the [sh] from second .33 to .46 is a good example 
of a fricative. 


Spectra 

While some broad phonetic features (presence of voicing, stop closures, 
fricatives) can be interpreted from a waveform, more detailed classification 
(which vowel? which fricative?) requires a different representation of the 
spectral input in terms of spectral features. Spectral features are based on the in¬ 
sight of Fourier that every complex wave can be represented as a sum of 
many simple waves of different frequencies. A musical analogy for this is 
the chord; just as a chord is composed of multiple notes, any waveform is 
composed of the waves corresponding to its individual “notes’ 1 . 

Consider Figure 7.2 h which shows part of the waveform for the vowel 
[as] of the word had at second 0.9 of the sentence. Note that there is a com¬ 
plex wave which repeats about nine times in the figure; but there is also a 
smaller repeated wave which repeats four times for every larger pattern (no¬ 
tice the four small peaks inside each repeated wave). The complex wave has 
a frequency of about 250 Hz (we can figure this out since it repeats roughly 
9 times in .036 seconds, and 9 cyclcs/.036 seconds = 250 Hz). The smaller 
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wave then should have a frequency of roughly 4 Limes the frequency of the 
larger wave, or roughly 1000 Hz* Then if you look carefully you can see 
two little waves on Lhc peak of many of the 1000 Hz waves. The frequency 
of this tiniest wave must be roughly twice that of the 1000 Hz wave, hence 
2000 Hz. 

A spectrum is a representation of these different frequency eompo- spectrum 
nenis of a wave. It can be computed by a Fourier transform, a mathematical transform 
procedure which separates out each of the frequency components of a wave. 

Rather than using the Fourier transform spectrum directly, most speech ap~ 
plications use a smoothed version of the spectrum called the LPC spectrum lpc 
(A tal and Hanaucr, 1971; Itakura, 1975). 

Figure 7.22 shows an LPC spectrum for the waveform in Figure 7,21. 

LPC (Linear Predictive Coding) is a way of coding the spectrum which 
makes it easier to see where the spectral peaks are. peaks ral 



Figure 7.22 An LPC spectrum for the vowel [as] waveform of She just had 
a baby at the point in time shown in Figure 7*21 ♦ LPC makes it easy to see 

formants. 
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The x-axis of a spectrum shows frequency while the y-axis shows some 
measure of the magnitude of each frequency component (in decibels (dB), 
a logarithmic measure of amplitude). Thus Figure 7.22 shows that there are 
important frequency components at 930 Hz, 1860 Hz, and 3020 Hz, along 
with many other lower-magnitude frequency components. These important 
components at roughly 1000 Hz and 2000 Hz are just what we predicted by 
looking at the wave in Figure 7.21! 

Why is a spectrum useful? It turns out that these spectral peaks that 
are easily visible in a spectrum are very characteristic of different sounds; 
phones have characteristic spectral ‘signatures’. For example different chem¬ 
ical elements give off different wavelengths of light when they bum, allow¬ 
ing us to detect elements in stars light-years away by looking at the spectrum 
of the light. Similarly, by looking at the spectrum of a waveform, we can de¬ 
tect the characteristic signature of the different phones that are present. This 
use of spectral information is essential to both human and machine speech 
recognition. In human audition, the function of the cochlea or inner ear is 
to compute a spectmm of the incoming waveform. Similarly, the features 
used as input to the HMMs in speech recognition are all representations of 
spectra, usually variants of LPC spectra, as we will see. 

While a spectmm shows the frequency components of a wave at one 
point in time, a spectrogram is a way of envisioning how the different fre¬ 
quencies which make up a waveform change over time. The x-axis shows 
time, as it did for the waveform, but the y-axis now shows frequencies in Hz. 
The darkness of a point on a spectrogram corresponding to the amplitude of 
the frequency component. For example, look in Figure 7.23 around second 
0.9, and notice the dark bar at around 1000 Hz. This means that the [iy] 
of the word she has an important component around 1000 Hz (1000 Hz is 
just between the notes B and C). The dark horizontal bars on a spectrogram, 
representing spectral peaks, usually of vowels, are called formants. 

What specific clues can spectral representations give for phone identi¬ 
fication? First, different vowels have their formants at characteristic places. 
We’ve seen that [ae] in the sample waveform had formants at 930 Hz, 1860 
Hz, and 3020 Hz. Consider the vowel [iy], at the beginning of the utterance 
in Figure 7.20. The spectmm for this vowel is shown in Figure 7.24. The first 
formant of [iy] is 540 Hz; much lower than the first formant for [ae], while the 
second formant (2581 Hz) is much higher than the second formant for [ae]. 
If you look carefully you can see these formants as dark bars in Figure 7.23 
just around 0.5 seconds. 

The location of the first two formants (called FI and F2) plays a large 
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Figure 7*23 A spectrogram of Lhe sentence “She just had a baby” whose 
waveform was shown in Figure 7,20. One way to think of a spectrogram is as 
a collection of spectra (time-sliced) like Figure 7,22 placed end lo end. 



Figure 7*24 A smoothed (LPC) spectrum for the vowel [ty] at the start of 
She just had a bahy. Note thaL the first formanl (540 Hz) is much tower than 
the first formant for [ae] shown in Figure 7*22, while the second formant (25SI 
Hz) is much higher than the second formant for [ae]. 


role in determining vowel identity, although the formants still differ from 
speaker to speaker Formants also can be used to identify the nasal phones 
[n], [m], and [g], the lateral phone [1], and [r]. Why do different vowels have 
different spectral signatures? The formants are caused by the resonant cav¬ 
ities of the mouth. The oral cavity can be thought of as a filter which se¬ 
lectively passes through some of the harmonics of the vocal cord vibrations. 
Moving the tongue creates spaces of different size inside the mouth which 
selectively amplify waves of the appropriate wavelength, hence amplifying 
different frequency bands. 
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Feature Extraction 

Our survey of the features of waveforms and spectra was necessarily brief, 
but the reader should have the basic idea of the importance of spectral fea¬ 
tures and their relation to the original waveform. Let’s now summarize the 
process of extraction of spectral features, beginning with the sound wave 
itself and ending with a feature vector. 4 An input soundwave is first dig¬ 
itized. This process of analog-to-digital conversion has two steps: sam¬ 
pling and quantization. A signal is sampled by measuring its amplitude 
at a particular time; the sampling rate is the number of samples taken per 
second. Common sampling rates are 8,000 Hz and 16,000 Hz. In order to 
accurately measure a wave, it is necessary to have at least two samples in 
each cycle: one measuring the positive part of the wave and one measuring 
the negative part. More than two samples per cycle increases the amplitude 
accuracy, but less than two samples will cause the frequency of the wave to 
be completely missed. Thus the maximum frequency wave that can be mea¬ 
sured is one whose frequency is half the sample rate (since every cycle needs 
2 samples). This maximum frequency for a given sampling rate is called the 
Nyquist frequency. Most information in human speech is in frequencies be¬ 
low 10,000 Hz; thus a 20,000 Hz sampling rate would be necessary for com¬ 
plete accuracy. But telephone speech is filtered by the switching network, 
and only frequencies less than 4,000 Hz are transmitted by telephones. Thus 
an 8,000 Hz sampling rate is sufficient for telephone-bandwidth speech like 
the Switchboard corpus. 

Even an 8,000 Hz sampling rate requires 8000 amplitude measure¬ 
ments for each second of speech, and so it is important to store the amplitude 
measurement efficiently. They are usually stored as integers, either 8-bit 
(values from -128 - 127) or 16 bit (values from -32768 - 32767). This pro¬ 
cess of representing a real-valued number as a integer is called quantization 
because there is a minimum granularity (the quantum size) and all values 
which are closer together than this quantum size are represented identically. 

Once a waveform has been digitized, it is converted to some set of 
spectral features. An LPC spectrum is represented by a vector of features; 
each formant is represented by two features, plus two additional features to 
represent spectral tilt. Thus 5 formants can be represented by 12 (5x2+2) 
features. It is possible to use LPC features directly as the observation sym- 

4 The reader might want to bear in mind Picone’s (1993) reminder that the use of the word 
extraction should not be thought of as encouraging the metaphor of features as something 
‘in the signal’ waiting to be extracted. 
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bols of an HMM. However, further processing is often done to the features. 

CEPSTRj 

One popular feature set is cepstral coefficients, which are computed from coefr 
the LPC coefficients by taking the Fourier transform of the spectrum. An¬ 
other feature set, PLP (Perceptual Linear Predictive analysis (Hermansky, plp 
1990)), takes the LPC features and modifies them in ways consistent with 
human hearing. For example, the spectral resolution of human hearing is 
worse at high frequencies, and the perceived loudness of a sound is related 
to the cube rate of its intensity. So PLP applies various filters to the LPC 
spectrum and takes the cube root of the features. 


7.6 Computing Acoustic Probabilities 


The last section showed how the speech input can be passed through signal 
processing transformations and turned into a series of vectors of features, 
each vector representing one time-slice of the input signal. How are these 
feature vectors turned into probabilities? 

One way to compute probabilities on feature vectors is to first cluster 
them into discrete symbols that we can count; we can then compute the 
probability of a given cluster just by counting the number of times it occurs in 
some training set. This method is usually called vector quantization. Vector 
quantization was quite common in early speech recognition algorithms but 
has mainly been replaced by a more direct but compute-intensive approach: 
computing observation probabilities on a real-valued (‘continuous’) input 
vector. This method thus computes a probability density function or pdf 
over a continuous space. 

There are two popular versions of the continuous approach. The most 
widespread of the two is the use of Gaussian pdfs, in the simplest ver¬ 
sion of which each state has a single Gaussian function which maps the 
observation vector o t to a probability. An alternative approach is the use 
of neural networks or multi-layer perceptrons which can also be trained 
to assign a probability to a real-valued feature vector. HMMs with Gaus¬ 
sian observation-probability-estimators are trained by a simple extension to 
the forward-backward algorithm (discussed in Appendix D). HMMs with 
neural-net observation-probability-estimators are trained by a completely 
different algorithm known as error back-propagation. 

In the simplest use of Gaussians, we assume that the possible values 
of the observation feature vector o t are normally distributed, and so we rep¬ 
resent the observation probability function bj(o t ) as a Gaussian curve with 
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mean vector jjj and covariance matrix (prime denotes vector transpose). 
We present the equation here for completeness, although we will not cover 
the details of the mathematics: 




(7.10) 


Usually we make the simplifying assumption that the covariance ma¬ 
trix Ly is diagonal, i.e. that it contains the simple variance of cepstral feature 
1, the simple variance of cepstral feature 2, and so on, without worrying 
about the effect of cepstral feature 1 on the variance of cepstral feature 2. 
This means that in practice we are keeping only a single separate mean and 
variance for each feature in the feature vector. 

Most recognizers do something even more complicated; they keep 
multiple Gaussians for each state, so that the probability of each feature of 
the observation vector is computed by adding together a variety of Gaussian 
curves. This technique is called Gaussian mixtures. In addition, many ASR 
systems share Gaussians between states in a technique known as parameter 
tying (or tied mixtures) (Huang and Jack, 1989). For example acoustically 
similar phone states might share (i.e. use the same) Gaussians for some fea¬ 
tures. 

How are the mean and covariance of the Gaussians estimated? It is 
helpful again to consider the simpler case of a non-hidden Markov Model, 
with only one state L The vector of feature means ja and the vector of covari¬ 
ances L could then be estimated by averaging: 


1 T 

Pi = (7.H) 

£■ = ^X> f -AO)'(0 f -fij)\ (7.12) 

But since there are multiple hidden states, we don’t know which obser¬ 
vation vector o t was produced by which state. Appendix D will show how 
the forward-backward algorithm can be modified to assign each observation 
vector o t to every possible state /, prorated by the probability that the HMM 
was in state i at time t. 

An alternative way to model continuous-valued features is the use of a 
network neural network, multilayer perceptron (MLP) or Artificial Neural Net- 

perceptron works (ANNs). Neural networks are far too complex for us to introduce in 
mlp a page or two here; thus we will just give the intuition of how they are used 
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in probability estimation as an alternative to Gaussian estimators. The inter¬ 
ested reader should consult basic neural network textbooks (Anderson, 1995; 
Hertz et al , 1991) as well as references specifically focusing on neural- 
network speech recognition (Bourlard and Morgan, 1994). 

A neural network is a set of small computation units connected by 
weighted links. The network is given a vector of input values and computes 
a vector of output values. The computation proceeds by each computational 
unit computing some non-linear function of its input units and passing the 
resulting value on to its output units. 

The use of neural networks we will describe here is often called a hy- 
brid HMM-MLP approach, since it uses some elements of the HMM (such 
as the state-graph representation of the pronunciation of a word) but the 
observation-probability computation is done by an MLP instead of a mix¬ 
ture of Gaussians. The input to these MLPs is a representation of the signal 
at a time t and some surrounding window; for example this might mean a 
vector of spectral features for a time t and 8 additional vectors for times 
t + 10ms, l + 20ms, t + 30ms, l + 40ms, t — 10ms, etc. Thus the input to 
the network is a set of nine vectors, each vector having the complete set of 
real-valued spectral features for one time slice. The network has one output 
unit for each phone; by constraining the values of all the output units to sum 
to 1, the net can be used to compute the probability of a state j given an 
observation vector o t , or P{j\o t ). Figure 7.25 shows a sample of such a net. 

This MLP computes the probability of the HMM state j given an ob¬ 
servation o t , or P(qj\o t ). But the observation likelihood we need for the 
HMM, bj(o t ), is P(o t \qj). The Bayes rule can help us see how to compute 
one from the other. The net is computing: 

PtoyW = »<S) (7 ,3) 

We can rearrange the terms as follows: 


PipMj) _ P{qj\o t ) . . 

P(o<) P(qj) ' 

The two terms on the right-hand side of (7.14) can be directly com¬ 
puted from the MLP; the numerator is the output of the MLP, and the denom¬ 
inator is the total probability of a given state, summing over all observations 
(i.e. the sum over all t of a 7 (/)). Thus although we cannot directly compute 

P{o t \qj), we can use (7.14) to compute which is known as a scaled 

likelihood (the likelihood divided by the probability of the observation) . 
In fact, the scaled likelihood is just as good as the regular likelihood, since 
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Figure 7.25 A neural net used to estimate phone state probabilities. Such 
a net can be used in an HMM model as an alternative to the Gaussian models. 
This particular net is from the MLP systems described in Bourlard and Morgan 
(1994); it is given a vector of features for a frame and for the four frames 
on either side, and estimates p(qj\o t ). This probability is then converted to 
an estimate of the observation likelihood b — p(o t \qj) using the Bayes rule. 
These nets are trained using the error-back-propagation algorithm as part of 
the same embedded training algorithm that is used for Gaussians. 


the probability of the observation p(o t ) is a constant during recognition and 
doesn’t hurt us to have in the equation. 

The error-back-propagation algorithm for training an MLP requires 
that we know the correct phone label qj for each observation o t . Given a 
large training set of observations and correct labels, the algorithm iteratively 
adjusts the weights in the MLP to minimize the error with this training set. 
In the next section we will see where this labeled training set comes from, 
and how this training fits in with the embedded training algorithm used 
for HMMs. Neural nets seem to achieve roughly the same performance as 
a Gaussian model but have the advantage of using less parameters and the 
disadvantage of taking somewhat longer to train. 
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Methodology Box: Word Error Rate 


The standard evaluation metric for speech recognition systems 
is the word error rate. The word error rate is based on how much 
the word string returned by the recognizer (often called the hypoth¬ 
esized word string) differs front a correct or reference transcription. 
Given such a correct transcription, the first step in computing word 
error is to compute the minimum edit distance in words between 
the hypothesized and correct strings. The result of this computation 
will be the minimum number of word substitutions, word inser¬ 
tions, and word deletions necessary to map between the correct and 
hypothesized strings. The word error rate is then defined as follows 
(note that because the equation includes insertions, the error rate can 
be great than 100%): 


Word Error Rate 


100 


Insertions + Substitutions + Deletions 


Total Words in Correct Transcript 
Here is an example of alignments between a reference and a 
hypothesized utterance from the CALL HOME corpus, showing the 
counts used to compute the word error rate: 


REF: 

j 

44 

UM 

the PHONE TS i 

LEFT THE portable 

IIYP: 

i GOT 

IT 

TO 

the ***** FULLEST i 

I OVF TO po rtabi e 

Eval; 

I 

r 

S 

D S 

S S 

REF: 


PHONE 

UPSTAIRS last night so 

the battery ran out 

HYP; 

FORM 

OF 


STORES last night so 

the battery ran out 

Eval: 

I 

S 


S 



This utterance has 6 substitutions, 3 insertions, and 1 deletion: 

Word Error Rate = 100 6 + 3 + 1 =56% 

18 

As of the time of this writing, state-of-the-art speech recognition 
systems were achieving around 20% word error rate on natural- 
speech tasks tike the National Institute of Standards and Technology 
(NIST)’s Hub4 test set from the Broadcast News corpus (Chen et al„ 
1999), and around 40% word error rate on NIST's Hub5 test set from 
the combined Switchboard, Switchboard-11, and CALLMOME cor¬ 
pora (Hain et al ., 1999). 
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7.7 Training a Speech Recognizer 

We have now introduced all the algorithms which make up the standard 
speech recognition system that was sketched in Figure 7.2 on page 239. 
We’ve seen how to build a Yiterbi decoder, and how it takes 3 inputs (the 
observation likelihoods (via Gaussian or MLP estimation from the spectral 
features), the HMM lexicon, and the A-gram language model) and produces 
the most probable string of words. But we have not seen how all the proba¬ 
bilistic models that make up a recognizer get trained. 
training 0 In this section we give a brief sketch of the embedded training proce¬ 

dure that is used by most ASR systems, whether based on Gaussians, MLPs, 
or even vector quantization. Some of the details of the algorithm (like the 
forward-backward algorithm for training HMM probabilities) have been re¬ 
moved to Appendix D. 

Let’s begin by summarizing the four probabilistic models we need to 
train in a basic speech recognition system: 

• language model probabilities: P(w/|w/_iW/_ 2 ) 

• observation likelihoods: bj(o t ) 

• transition probabilities: 

• pronunciation lexicon: HMM state graph structure 
In order to train these components we usually have 

• a training corpus of speech wavefiles, together with a word-transcription. 

• a much larger corpus of text for training the language model, includ¬ 
ing the word-transcriptions from the speech corpus together with many 
other similar texts. 

• often a smaller training corpus of speech which is phonetically labeled 
(i.e. frames of the acoustic signal are hand-annotated with phonemes). 

Let’s begin with the A-gram language model. This is trained in the 
way we described in Chapter 6; by counting A-gram occurrences in a large 
corpus, then smoothing and normalizing the counts. The corpus used for 
training the language model is usually much larger than the corpus used to 
train the HMM a and b parameters. This is because the larger the training 
corpus the more accurate the models. Since A-gram models are much faster 
to train than HMM observation probabilities, and since text just takes less 
space than speech, it turns out to be feasible to train language models on 
huge corpora of as much as half a billion words of text. Generally the corpus 
used for training the HMM parameters is included as part of the language 
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model training data; it is important that the acoustic and language model 
training be consistent. 

The HMM lexicon structure is built by hand, by taking an off-the-shelf 
pronunciation dictionary such as the PRONLEX dictionary (LDC, 1995) or 
the CMUdict dictionary, both described in Chapter 4. In some systems, each 
phone in the dictionary maps into a state in the HMM. So the word cat would 
have 3 states corresponding to [k], [ae], and [t]. Many systems, however, use 
the more complex subphone structure described on page 249, in which each 
phone is divided into 3 states: the beginning, middle and final portions of 
the phone, and in which furthermore there are separate instances of each of 
these subphones for each triphone context. 

The details of the embedded training of the HMM parameters varies; 
we’ll present a simplified version. First, we need some initial estimate of 
the transition and observation probabilities and bj(o t ). For the transi¬ 
tion probabilities, we start by assuming that for any state all the possible 
following states are all equiprobable. The observation probabilities can be 
bootstrapped from a small hand-labeled training corpus. For example, the 
TIMIT or Switchboard corpora contain approximately 4 hours each of pho¬ 
netically labeled speech. They supply a ‘correct’ phone state label q for each 
frame of speech. These can be fed to an MLP or averaged to give initial 
Gaussian means and variances. For MLPs this initial estimate is important, 
and so a hand-labeled bootstrap is the norm. For Gaussian models the initial 
value of the parameters seems to be less important and so the initial mean 
and variances for Gaussians often are just set identically for all states by 
using the mean and variances of the entire training set. 

Now we have initial estimates for the a and b probabilities. The next 
stage of the algorithm differs for Gaussian and MLP systems. For MLP sys¬ 
tems we apply what is called a forced Viterbi alignment. A forced Yiterbi viterb? 
alignment takes as input the correct words in an utterance, along with the 
spectral feature vectors. It produces the best sequence of HMM states, with 
each state aligned with the feature vectors. A forced Yiterbi is thus a simpli¬ 
fication of the regular Yiterbi decoding algorithm, since it only has to figure 
out the correct phone sequence, but doesn’t have to discover the word se¬ 
quence. It is called forced because we constrain the algorithm by requiring 
the best path to go through a particular sequence of words. It still requires 
the Viterbi algorithm since words have multiple pronunciations, and since 
the duration of each phone is not fixed. The result of the forced Viterbi is a 
set of features vectors with ‘correct’ phone labels, which can then be used to 
retrain the neural network. The counts of the transitions which are taken in 
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the forced alignments can be used to estimate the HMM transition probabil¬ 
ities. 

For the Gaussian HMMs, instead of using forced Yiterbi, we use the 
forward-backward algorithm described in Appendix D. We compute the for¬ 
ward and backward probabilities for each sentence given the initial a and 
b probabilities, and use them to re-estimate the a and b probabilities. Just 
as for the MLP situation, the forward-backward algorithm needs to be con¬ 
strained by our knowledge of the correct words. The forward-backward al¬ 
gorithm computes its probabilities given a model X. We use the ‘known’ 
words sequence in a transcribed sentence to tell us which word models to 
string together to get the model X that we use to compute the forward and 
backward probabilities for each sentence. 


7.8 Waveform Generation for Speech Synthesis 

Now that we have covered acoustic processing we can return to the acoustic 
component of a text-to-speech (TTS) system. Recall from Chapter 4 that the 
output of the linguistic processing component of a TTS system is a sequence 
of phones, each with a duration, and a FO contour which specifies the pitch. 
target This specification is often called the target, as it is this that we want the 
synthesizer to produce. 

The most commonly used type of algorithm works by waveform con- 
concatena- catenation. Such concatenative synthesis is based on a database of speech 
that has been recorded by a single speaker. This database is then segmented 
into a number of short units, which can be phones, diphones, syllables, words 
or other units. The simplest sort of synthesizer would have phone units and 
the database would have a single unit for each phone in the phone inventory. 
By selecting units appropriately, we can generate a series of units which 
match the phone sequence in the input. By using signal processing to smooth 
joins at the unit edges, we can simply concatenate the waveforms for each of 
these units to form a single synthetic speech waveform. 

Experience has shown that single phone concatenative systems don’t 
produce good quality speech. Just as in speech recognition, the context of 
the phone plays an important role in its acoustic pattern and hence a /t I before 
a /a/ sounds very different from a Itl before an /s/. 

The triphone models described in Figure 7.11 on page 249 are a pop¬ 
ular choice of unit in speech recognition, because they cover both the left 
and right contexts of a phone. Unfortunately, a language typically has a 
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very large number of triphones (tens of thousands) and it is currently pro¬ 
hibitive to collect so many units for speech synthesis. Hence diphones are 
often used in speech synthesis as they provide a reasonable balance between 
context-dependency and size (typically 1000-2000 in a language). In speech 
synthesis, diphone units normally start half-way through the first phone and 
end half-way through the second. This is because it is known that phones are 
more stable in the middle than at the edges, so that the middles of most /a/ 
phones in a diphone are reasonably similar, even if the acoustic patterns start 
to differ substantially after that. If diphones are concatenated in the middles 
of phones, the discontinuities between adjacent units are often negligible. 

Pitch and Duration Modification 

The diphone synthesizer as just described will produce a reasonable qual¬ 
ity speech waveform corresponding to the requested phone sequence. But 
the pitch and duration (i.e. the prosody) of each phone in the concatenated 
waveform will be the same as when the diphones were recorded and will not 
correspond to the pitch and durations requested in the input. The next stage 
of the synthesis process therefore is to use signal processing techniques to 
change the prosody of the concatenated waveform. 

The linear prediction (LPC) model described earlier can be used for 
prosody modification as it explicitly separates the pitch of a signal from its 
spectral envelope If the concatenated waveform is represented by a sequence 
of linear prediction coefficients, a set of pulses can be generated correspond¬ 
ing to the desired pitch and used to re-excite the coefficients to produce a 
speech waveform again. By contracting and expanding frames of coeffi¬ 
cients, the duration can be changed. While linear prediction produces the 
correct F0 and durations it produces a somewhat “buzzy” speech signal. 

Another technique for achieving the same goal is the time-domain 
pitch-synchronous overlap and add (TD-PSOLA) technique. TD-PSOLA 
works pitch-synchronously in that each frame is centered around a pitch- 
mark in the speech, rather than at regular intervals as in normal speech sig¬ 
nal processing. The concatenated waveform is split into a number of frames, 
each centered around a pitchmark and extending a pitch period either side. 
Prosody is changed by recombining these frames at a new set of pitchmarks 
determined by the requested pitch and duration of the input. The synthetic 
waveform is created by simply overlapping and adding the frames. Pitch is 
increased by making the new pitchmarks closer together (shorter pitch peri¬ 
ods implies higher frequency pitch), and decreased by making them further 


DIPHONES 


TD-PSOLA 
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apart. Speech is made longer by duplication frames and shorter by leav¬ 
ing frames out. The operation of TD-PSOLA can be compared to that of a 
tape recorder with variable speed - if you play back a tape faster than it was 
recorded, the pitch periods will come closer together and hence the pitch 
will increase. But speeding up a tape recording effectively increases the fre¬ 
quency of all the components of the speech (including the formants which 
characterize the vowels) and will give the impression of a “squeaky”, unnat¬ 
ural voice. TD-PSOLA differs because it separates each frame first and then 
decreases the distance between the frames. Because the internals of each 
frame aren’t changed, the frequency of the non-pitch components is hardly 
altered, and the resultant speech sounds the same as the original except with 
a different pitch. 

Unit Selection 

While signal processing and diphone concatenation can produce reasonable 
quality speech, the result is not ideal. There are a number of reasons for this, 
but they all boil down to the fact that having a single example of each diphone 
is not enough. First of all, signal processing inevitably incurs distortion, 
and the quality of the speech gets worse when the signal processing has to 
stretch the pitch and duration by large amounts. Furthermore, there are many 
other subtle effects which are outside the scope of most signal processing 
algorithms. For instance, the amount of vocal effort decreases over time as 
the utterance is spoken, producing weaker speech at the end of the utterance. 
If diphones are taken from near the start of an utterance, they will sound 
unnatural in phrase-final positions. 

Unit-selection synthesis is an attempt to address this problem by col¬ 
lecting several examples of each unit at different pitches and durations and 
linguistic situations, so that the unit is close to the target in the first place 
and hence the signal processing needs to do less work. One technique for 
unit-selection (Hunt and Black, 1996) works as follows: 

The input to the algorithm is the same as other concatenative synthe¬ 
sizers, with the addition that the FO contour is now specified as three FO 
values per phone, rather than as a contour. The technique uses phones as 
its units, indexing phones in a large database of naturally occurring speech 
Each phone in the database is also marked with a duration and three pitch 
values. The algorithm works in two stages. First, for each phone in the target 
word, a set of candidate units which match closely in terms of phone identity, 
duration and FO is selected from the database. These candidates are ranked 
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using a target cost function, which specifies just how close each unit actu¬ 
ally is to the target. The second part of the algorithm works by measuring 
how well each candidate for each unit joins with its neighbor’s candidates. 
Various locations for the joins are assessed, which allows the potential for 
units to be joined in the middle, as with diphones. These potential joins are 
ranked using a concatenation cost function. The final step is to pick the best 
set of units which minimize the overall target and concatenation cost for the 
whole sentence. This step is performed using the Viterbi algorithm in a sim¬ 
ilar way to HMM speech recognition: here the target cost is the observation 
probability and the concatenation cost is the transition probability. 

By using a much larger database which contains many examples of 
each unit, unit-selection synthesis often produces more natural speech than 
straight diphone synthesis. Some systems then use signal processing to make 
sure the prosody matches the target, while others simply concatenate the 
units following the idea that a utterance which only roughly matches the 
target is better than one that exactly matches it but also has some signal 
processing distortion. 

7.9 Human Speech Recognition 

Speech recognition in humans shares some features with the automatic speech 
recognition models we have presented. We mentioned above that signal pro¬ 
cessing algorithms like PLP analysis (Hermansky, 1990) were in fact in¬ 
spired by properties of the human auditory system. In addition, four proper¬ 
ties of human lexical access (the process of retrieving a word from the men- Access 
tal lexicon) are also true of ASR models: frequency, parallelism, neigh¬ 
borhood effects, and cue-based processing. For example, as in ASR with 
its A-gram language models, human lexical access is sensitive to word fre¬ 
quency, High-frequency spoken words are accessed faster or with less in¬ 
formation than low-frequency words. They are successfully recognized in 
noisier environments than low frequency words, or when only parts of the 
words are presented (Howes, 1957; Grosjean, 1980; Tyler, 1984, inter alia). 

Like ASR models, human lexical access is parallel: multiple words are ac¬ 
tive at the same time (Marslen-Wilson and Welsh, 1978; Salasoo and Pisoni, 

1985, inter alia). Human lexical access exhibits neighborhood effects (the 
neighborhood of a word is the set of words which closely resemble it). 

Words with large frequency-weighted neighborhoods are accessed slower 
than words with less neighbors (Luce et al ., 1990). Jurafsky (1996) shows 
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that the effect of neighborhood on access can be explained by the Bayesian 
models used in ASR. 

Finally, human speech perception is cue-based: speech input is inter¬ 
preted by integrating cues at many different levels. For example, there is 
evidence that human perception of individual phones is based on the inte¬ 
gration of multiple cues, including acoustic cues, such as formant structure 
or the exact timing of voicing, (Oden and Massaro, 1978; Miller, 1994), vi¬ 
sual cues, such as lip movement (Massaro and Cohen, 1983; Massaro, 1998), 
and lexical cues such as the identity of the word in which the phone is placed 
(Warren, 1970; Samuel, 1981; Connine and Clifton, 1987; Connine, 1990). 
For example, in what is often called the phoneme restoration effect, Warren 
(1970) took a speech sample and replaced one phone (e.g. the [s] in legisla¬ 
ture) with a cough. Warren found that subjects listening to the resulting tape 
typically heard the entire word legislature including the [s], and perceived 
the cough as background. Other cues in human speech perception include 
association semantic word association (words are accessed more quickly if a semanti- 
™ 0N cally related word has been heard recently) and repetition priming (words 
are accessed more quickly if they themselves have just been heard). The 
intuitions of both of these results are incorporated into recent language mod¬ 
els discussed in Chapter 6, such as the cache model of Kuhn and de Mori 
(1990), which models repetition priming, or the trigger model of Rosenfeld 
(1996) and the LSA models of Coccaro and Jurafsky (1998) and Bellegarda 
(1999), which model word association. In a fascinating reminder that good 
ideas are never discovered only once, Cole and Rudnicky (1983) point out 
that many of these insights about context effects on word and phone pro¬ 
cessing were actually discovered by William Bagley (Bagley, 1901). Bagley 
achieved his results, including an early version of the phoneme restoration 
effect, by recording speech on Edison phonograph cylinders, modifying it, 
and presenting it to subjects. Bagley’s results were forgotten and only redis¬ 
covered much later. 

One difference between current ASR models and human speech recog¬ 
nition is the time-course of the model. It is important for the performance of 
the ASR algorithm that the the decoding search optimizes over the entire ut¬ 
terance. This means that the best sentence hypothesis returned by a decoder 
at the end of the sentence may be very different than the current-best hy¬ 
pothesis, half way into the sentence. By contrast, there is extensive evidence 
on-line that human processing is on-line: people incrementally segment and utter¬ 
ance into words and assign it an interpretation as they hear it. For example, 
Marslen-Wilson (1973) studied close shadowers: people who are able to 
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shadow (repeat back) a passage as they hear it with lags as short as 250 ms. 
Marslen-Wilson found that when these shadowers made errors, they were 
syntactically and semantically appropriate with the context, indicating that 
word segmentation, parsing, and interpretation took place within these 250 
ms. Cole (1973) and Cole and Jakimik (1980) found similar effects in their 
work on the detection of mispronunciations. These results have led psy¬ 
chological models of human speech perception (such as the Cohort model 
(Marslen-Wilson and Welsh, 1978) and the computational TRACE model 
(McClelland and Elman, 1986)) to focus on the time-course of word selec¬ 
tion and segmentation. The TRACE model, for example, is a connectionist tionist" 
or neural network interactive-activation model, based on independent com- network 
putational units organized into three levels: feature, phoneme, and word. 

Each unit represents a hypothesis about its presence in the input. Units are 
activated in parallel by the input, and activation flows between units; con¬ 
nections between units on different levels are excitatory, while connections 
between units on single level are inhibitatory. Thus the activation of a word 
slightly inhibits all other words. 

We have focused on the similarities between human and machine speech 
recognition; there are also many differences. In particular, many other cues 
have been been shown to play a role in human speech recognition but have 
yet to be successfully integrated into ASR. The most important class of these 
missing cues is prosody. To give only one example, Cutler and Norris (1988), 

Cutler and Carter (1987) note that most multisyllabic English word tokens 
have stress on the initial syllable, suggesting in their metrical segmentation 
strategy (MSS) that stress should be used as a cue for word segmentation. 


7.10 Summary 

Together with chapters 4, 5, and 6, this chapter introduced the fundamental 
algorithms for addressing the problem of Large Vocabulary Continuous 
Speech Recognition and Text-To-Speech synthesis. 

• The input to a speech recognizer is a series of acoustic waves. The 
waveform, spectrogram and spectrum are among the visualization 
tools used to understand the information in the signal. 

• In the first step in speech recognition, wound waves are sampled, 
quantized, and converted to some sort of spectral representation; A 
commonly used spectral representation is the LPC cepstrum, which 
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provides a vector of features for each time-slice of the input. 

• These feature vectors are used to estimate the phonetic likelihoods 
(also called observation likelihoods) either by a mixture of Gaussian 
estimators or by a neural net. 

• Decoding or search is the process of finding the optimal sequence of 
model states which matches a sequence of input observations. (The 
fact that are two terms for this process is a hint that speech recogni¬ 
tion is inherently inter-disciplinary, and draws its metaphors from more 
than one field; decoding comes from information theory, and search 
from artificial intelligence). 

• We introduced two decoding algorithms: time-synchronous Viterbi 
decoding (which is usually implemented with pruning and can then be 
called beam search) and stack or A* decoding. Both algorithms take 
as input a series of feature vectors, and 2 ancillary algorithms: one for 
assigning likelihoods (e.g. Gaussians or MLP) and one for assigning 
priors (e.g. an A-gram language model). Both give as output a string 
of words. 

• The embedded training paradigm is the normal method for training 
speech recognizers. Given an initial lexicon with hand-built pronunci¬ 
ation structures, it will train the HMM transition probabilities and the 
HMM observation probabilities. This HMM observation probability 
estimation can be done via a Gaussian or an MLR 

• One way to implement the acoustic component of a TTS system is with 
concatenative synthesis, in which an utterance is built by concatenat¬ 
ing and then smoothing diphones taken from a large database of speech 
recorded by a single speaker. 


Bibliographical and Historical Notes 

The first machine which recognized speech was probably a commercial toy 
named “Radio Rex” which was sold in the 1920’s. Rex was a celluloid dog 
which moved (via a spring) when the spring was released by 500 Hz acoustic 
energy. Since 500 Hz is roughly the first formant of the vowel in “Rex”, the 
dog seemed to come when he was called (David and Selfridge, 1962). 

By the late 1940’s and early 1950’s, a number of machine speech 
recognition systems had been built. An early Bell Labs system could rec¬ 
ognize any of the 10 digits from a single speaker (Davis et al , 1952). This 
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system had 10 speaker-dependent stored patterns, one for each digit, each of 
which roughly represented the first two vowel formants in the digit. They 
achieved 97-99% accuracy by choosing the pattern which had the highest 
relative correlation coefficient with the input. Fry (1959) and Denes (1959) 
built a phoneme recognizer at University College, London, which recognized 
four vowels and nine consonants based on a similar pattern-recognition prin¬ 
ciple. Fry and Denes’s system was the first to use phoneme transition prob¬ 
abilities to constrain the recognizer. 

The late 1960s and early 1970’s produced a number of important para¬ 
digm shifts. First were a number of feature-extraction algorithms, include 
the efficient Fast Fourier Transform (FFT) (Cooley and Tukey, 1965), the 
application of cepstral processing to speech (Oppenheim et al , 1968), and 
the development of LPC for speech coding (Atal and Hanauer, 1971). Sec¬ 
ond were a number of ways of handling warping; stretching or shrinking warping 
the input signal to handle differences in speaking rate and segment length 
when matching against stored patterns. The natural algorithm for solving 
this problem was dynamic programming, and, as we saw in Chapter 5, the 
algorithm was reinvented multiple times to address this problem. The first 
application to speech processing was by Yintsyuk (1968), although his re¬ 
sult was not picked up by other researchers, and was reinvented by Velichko 
and Zagoruyko (1970) and Sakoe and Chiba (1971) (and (1984)). Soon af¬ 
terwards, Itakura (1975) combined this dynamic programming idea with the 
LPC coefficients that had previously been used only for speech coding. The 
resulting system extracted LPC features for incoming words and used dy¬ 
namic programming to match them against stored LPC templates. 

The third innovation of this period was the rise of the HMM. Hid¬ 
den Markov Models seem to have been applied to speech independently 
at two laboratories around 1972. One application arose from the work of 
statisticians, in particular Baum and colleagues at the Institute for Defense 
Analyses in Princeton on HMMs and their application to various predic¬ 
tion problems (Baum and Petrie, 1966; Baum and Eagon, 1967). James 
Baker learned of this work and applied the algorithm to speech process¬ 
ing (Baker, 1975) during his graduate work at CMU. Independently, Freder¬ 
ick Jelinek, Robert Mercer, and Lalit Bahl (drawing from their research in 
information-theoretical models influenced by the work of Shannon (1948)) 
applied HMMs to speech at the IBM Thomas J. Watson Research Center 
(Jelinek et al , 1975). IBM’s and Baker’s systems were very similar, par¬ 
ticularly in their use of the Bayesian framework described in this chapter. 

One early difference was the decoding algorithm; Baker’s DRAGON system 
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used Viterbi (dynamic programming) decoding, while the IBM system ap¬ 
plied Jelinek’s stack decoding algorithm (Jelinek, 1969). Baker then joined 
the IBM group for a brief time before founding the speech-recognition com¬ 
pany Dragon Systems. The HMM approach to speech recognition would 
turn out to completely dominate the field by the end of the century; indeed 
the IBM lab was the driving force in extending statistical models to natural 
language processing as well, including the development of class-based N- 
grams, HMM-based part-of-speech tagging, statistical machine translation, 
and the use of entropy/perplexity as an evaluation metric. 

The use of the HMM slowly spread through the speech community. 
One cause was a number of research and development programs sponsored 
by the Advanced Research Projects Agency of the U.S. Department of De¬ 
fense (ARPA). The first five-year program starting in 1971, and is reviewed 
in Klatt (1977). The goal of this first program was to build speech under¬ 
standing systems based on a few speakers, a constrained grammar and lexi¬ 
con (1000 words), and less than 10% semantic error rate. Four systems were 
funded and compared against each other: the System Development Corpo¬ 
ration (SDC) system, Bolt, Beranek & Newman (BBN)’s HWIM system, 
Camegie-Mellon University’s Hearsay-II system, and Camegie-Mellon’s Harpy 
system (Lowerre, 1968). The Harpy system used a simplified version of 
Baker’s HMM-based DRAGON system and was the best of the tested sys¬ 
tems, and according to Klatt the only one to meet the original goals of the 
ARPA project (with a semantic error rate of 94% on a simple task). 

Beginning in the mid-80’s, ARPA funded a number of new speech 
research programs. The first was the “Resource Management” (RM) task 
(Price et al , 1988), which like the earlier ARPA task involved transcrip¬ 
tion (recognition) of read-speech (speakers reading sentences constructed 
from a 1000-word vocabulary) but which now included a component that 
involved speaker-independent recognition. Later tasks included recognition 
of sentences read from the Wall Street Journal (WSJ) beginning with limited 
systems of 5,000 words, and finally with systems of unlimited vocabulary 
(in practice most systems use approximately 60,000 words). Later speech- 
recognition tasks moved away from read-speech to more natural domains; 
the Broadcast News (also called Hub-4) domain (LDC, 1998; Graff, 1997) 
(transcription of actual news broadcasts, including quite difficult passages 
such as on-the-street interviews) and the CALLHOME and CALLFRIEND 
domain (LDC, 1999) (natural telephone conversations between friends), part 
of what was also called Hub-5. The Air Traffic Information System (ATIS) 
task (Hemphill et al , 1990) was a speech understanding task whose goal 
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was to simulate helping a user book a flight, by answering questions about 
potential airlines, times, dates, etc. 

Each of the ARPA tasks involved an approximately annual bake-off at bake-off 
which all ARPA-funded systems, and many other ‘volunteer’ systems from 
North American and Europe, were evaluated against each other in terms of 
word error rate or semantic error rate. In the early evaluations, for-profit cor¬ 
porations did not generally compete, but eventually many (especially IBM 
and ATT) competed regularly. The ARPA competitions resulted in widescale 
borrowing of techniques among labs, since it was easy to see which ideas 
had provided an error-reduction the previous year, and were probably an im¬ 
portant factor in the eventual spread of the HMM paradigm to virtual every 
major speech recognition lab. The ARPA program also resulted in a number 
of useful databases, originally designed for training and testing systems for 
each evaluation (TIMIT, RM, WSJ, ATIS, BN, CALLHOME, Switchboard) 
but then made available for general research use. 

There are a number of textbooks on speech recognition that are good 
choices for readers who seek a more in-depth understanding of the material 
in this chapter: Jelinek (1997), Gold and Morgan (1999), and Rabiner and 
Juang (1993) are the most comprehensive. The last two textbooks also have 
comprehensive discussions of the history of the field, and together with the 
survey paper of Levinson (1995) have influenced our short history discussion 
in this chapter. Our description of the forward-backward algorithm was mod¬ 
eled after Rabiner (1989). Another useful tutorial paper is Knill and Young 
(1997). Research in the speech recognition field often appears in the pro¬ 
ceedings of the biennial EUROSPEECH Conference and the International 
Conference on Spoken Language Processing (ICSLP), held in alternating 
years, as well as the annual IEEE International Conference on Acoustics, 

Speech, and Signal Processing (ICASSP). Journals include Speech Com¬ 
munication, Computer Speech and Language, IEEE Transactions on Pattern 
Analysis and Machine Intelligence, and IEEE Transactions on Acoustics, 

Speech, and Signal Processing. 


Exercises 


7.1 Analyze each of the errors in the incorrectly recognized transcription 
of “um the phone is I left the... ” on page 269. For each one, give your best 
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guess as to whether you think it is caused by a problem in signal process¬ 
ing, pronunciation modeling, lexicon size, language model, or pruning in the 
decoding search. 

7.2 In practice, speech recognizers do all their probability computation us- 
logprob ing the log probability (or logprob) rather than actual probabilities. This 

helps avoid underflow for very small probabilities, but also makes the Yiterbi 
algorithm very efficient, since all probability multiplications can be imple¬ 
mented by adding log probabilities. Rewrite the pseudocode for the Viterbi 
algorithm in Figure 7.9 on page 247 to make use of logprobs instead of prob¬ 
abilities. 

7.3 Now modify the Viterbi algorithm in Figure 7.9 on page 247 to im¬ 
plement the beam search described on page 249. Hint: You will probably 
need to add in code to check whether a given state is at the end of a word or 
not. 

7.4 Finally, modify the Yiterbi algorithm in Figure 7.9 on page 247 with 
more detailed pseudocode implementing the array of backtrace pointers. 

7.5 Implement the Stack decoding algorithm of Figure 7.14 on 254. Pick 
a very simple h* function like an estimate of the number of words remaining 
in the sentence. 

7.6 Modify the forward algorithm of Figure 5.16 to use the tree-structured 
lexicon of Figure 7.18 on page 257. 



Part II 

SYNTAX 


If words are the foundation of speech and language processing, syn¬ 
tax is the skeleton. Syntax is the study of formal relationships be¬ 
tween words. These six chapters study how words are clustered into 
classes called parts-of-speech, how they group with their neighbors 
into phrases, and the way words depends on other words in a sentence. 
The section explores computational models of all of these kinds of 
knowledge, including context-free grammars, lexicalized grammars, 
feature structures, and metatheoretical issues like the Chomsky hi¬ 
erarchy. It introduces fundamental algorithms for dealing with this 
knowledge, like the Earley and CYK algorithms for parsing and the 
unification algorithm for feature combination. It also includes proba¬ 
bilistic models of this syntactic knowledge, including HMM part-of- 
speech taggers, and probabilistic context-free grammars. Finally, this 
section will explore psychological models of human syntactic pro¬ 
cessing. 








WORD CLASSES AND 

PART-OF-SPEECH 

TAGGING 


Conjunction Junction, what's your function? 

Bob Dorough, Schoolhouse Rock, 1973 

There are ten parts of speech, and they are all troublesome. 
Mark Twain, The Awful German Language 


The definitions [of the parts of speech] are very far from having 
attained the degree of exactitude found in Euclidean geometry. 
Otto Jespersen, The Philosophy of Grammar, 1924 


Words are traditionally grouped into equivalence classes called parts of 
speech (POS; Latin pars orationis ), word classes, morphological classes, speech f 
or lexical tags. In traditional grammars there were generally only a few parts pos 
of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.), classes 
More recent models have much larger numbers of word classes (45 for the 
Penn Treebank (Marcus et al , 1993), 87 for the Brown corpus (Francis, 

1979; Francis and Kucera, 1982), and 146 for the C7 tagset (Garside et al , 

1997)). 

The part of speech for a word gives a significant amount of information 
about the word and its neighbors. This is clearly true for major categories, 

(verb versus noun), but is also true for the many finer distinctions. For ex¬ 
ample these tagsets distinguish between possessive pronouns (my, your, his, 
her, its) and personal pronouns (/, you, he, me). Knowing whether a word is 
a possessive pronoun or a personal pronoun can tell us what words are likely 
to occur in its vicinity (possessive pronouns are likely to be followed by a 
noun, personal pronouns by a verb). This can be useful in a language model 
for speech recognition. 
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A word’s part-of-speech can tell us something about how the word is 
pronounced. As Chapter 4 discussed, the word content , for example, can be 
a noun or an adjective. They are pronounced differently (the noun is pro¬ 
nounced CONtent and the adjective conTENT). Thus knowing the part of 
speech can produce more natural pronunciations in a speech synthesis sys¬ 
tem and more accuracy in a speech recognition system. (Other pairs like this 
include OBject (noun) and obJECT (verb), Discount (noun) and disCOUNT 
(verb); see Cutler (1986)). 

Parts of speech can also be used in stemming for informational retrieval 
(IR), since knowing a word’s part of speech can help tell us which morpho¬ 
logical affixes it can take, as we saw in Chapter 3. They can also help an 
IR application by helping select out nouns or other important words from a 
document. Automatic part-of-speech taggers can help in building automatic 
word-sense disambiguating algorithms, and POS taggers are also used in ad¬ 
vanced ASR language models such as class-based N-grams, discussed in 
Section 8.7. Parts of speech are very often used for ‘partial parsing’ texts, 
for example for quickly finding names or other phrases for the information 
extraction applications discussed in Chapter 15. Finally, corpora that have 
been marked for part-of-speech are very useful for linguistic research, for 
example to help find instances or frequencies of particular constructions in 
large corpora. 

The remainder of this chapter begins in Section 8.1 with a summary of 
English word classes, followed by a description in Section 8.2 of different 
tagsets for formally coding these classes. The next three sections then in¬ 
troduces three tagging algorithms: rule-based tagging, stochastic tagging, 
and transformation-based tagging. 


8.1 (Mostly) English Word Classes 

Well, every person you can know, 

And every place that you can go, 

And anything that you can show, 

You know they're nouns. 

Lynn Ahrens, Schoolhouse Rock , 1973 


Until now we have been using part-of-speech terms like noun and verb 
rather freely. In this section we give a more complete definition of these 
and other classes. Traditionally the definition of parts of speech has been 
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based on morphological and syntactic function; words that function simi¬ 
larly with respect to the affixes they take (their morphological properties) or 
with respect to what can occur nearby (their ‘distributional properties’) are 
grouped into classes. While word classes do have tendencies toward seman¬ 
tic coherence (nouns do in fact often describe ‘people, places or things’, and 
adjectives often describe properties), this is not necessarily the case, and in 
general we don’t use semantic coherence as a definitional criterion for parts 
of speech. 

Parts of speech can be divided into two broad supercategories: closed 
class types and open class types. Closed classes are those that have relatively claIs d 
fixed membership. For example, prepositions are a closed class because open class 
there is a fixed set of them in English; new prepositions are rarely coined. By 
contrast nouns and verbs are open classes because new nouns and verbs are 
continually coined or borrowed from other languages (e.g. the new verb to 
fax or the borrowed noun futon). It is likely that any given speaker or corpus 
will have different open class words, but all speakers of a language, and 
corpora that are large enough, will likely share the set of closed class words. 

Closed class words are generally also function words; function words are wor c d t s on 
grammatical words like of it , and, or you , which tend to be very short, occur 
frequently, and play an important role in grammar. 

There are four major open classes that occur in the languages of the 
world: nouns, verbs, adjectives, and adverbs. It turns out that English has nouns 
all four of these, although not every language does. Many languages have no verbs 
adjectives. In the native American language Lakhota, for example, and also adjectives 
possibly in Chinese, the words corresponding to English adjectives act as a adverbs 
subclass of verbs. 

Every known human language has at least the two categories noun and 
verb (although in some languages, for example Nootka, the distinction is 
subtle). Noun is the name given to the lexical class in which the words for 
most people, places, or things occur. But since lexical classes like noun are 
defined functionally (morphological and syntactically) rather than seman¬ 
tically, some words for people, places, and things may not be nouns, and 
conversely some nouns may not be words for people, places, or things. Thus 
nouns include concrete terms like ship and chair , abstractions like band¬ 
width and relationship , and verb-like terms like pacing in His pacing to and 
fro became quite annoying). What defines a noun in English, then, are things 
like its ability to occur with determiners (a goat, its bandwidth , Plato's Re¬ 
public ), to take possessives {IBM's annual revenue ), and for most but not all 
nouns, to occur in the plural form (goats, abaci). 
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PROPER 

NOUNS 

COMMON 

NOUNS 


COUNT 

NOUNS 

MASS NOUNS 


AUXILIARIES 


Nouns are traditionally grouped into proper nouns and common nouns. 
Proper nouns, like Regina, Colorado, and IBM, are names of specific persons 
or entities. In English, they generally aren’t preceded by articles (e.g. the 
book is upstairs , but Regina is upstairs). In written English, proper nouns 
are usually capitalized. 

In many languages, including English, common nouns are divided into 
count nouns and mass nouns. Count nouns are those that allow gram¬ 
matical enumeration; that is, they can occur in both the singular and plural 
(goat/goats, relationship/relationships) and they can be counted (one goat , 
two goats). Mass nouns are used when something is conceptualized as a ho¬ 
mogeneous group. So words like snow ; salt , and communism are not counted 
(i.e. *two snows or *two communisms). Mass nouns can also appear without 
articles where singular count nouns cannot (Snow is white but not *Goat is 
white). 

The verb class includes most of the words referring to actions and pro¬ 
cesses, including main verbs like draw, provide, differ , and go. As we saw 
in Chapter 3, English verbs have a number of morphological forms (non- 
3rd-person-sg (eat), 3d-person-sg (eats), progressive (eating), past partici¬ 
ple eaten). A subclass of English verbs called auxiliaries will be discussed 
when we turn to closed class forms. 

The third open class English form is adjectives; semantically this class 
includes many terms that describe properties or qualities. Most languages 
have adjectives for the concepts of color (white, black), age (old, young), 
and value (good, bad), but there are languages without adjectives. As we 
discussed above, many linguists argue that the Chinese family of languages 
uses verbs to describe such English-adjectival notions as color and age. 

The final open class form, adverbs, is rather a hodge-podge, both se¬ 
mantically and formally. For example Schachter (1985) points out that in a 
sentence like the following, all the italicized words are adverbs: 


Unfortunately, John walked home extremely slowly yesterday 


What coherence the class has semantically may be solely that each of 
these words can be viewed as modifying something (often verbs, hence the 
name ‘adverb’, but also other adverbs and entire verb phrases). Directional 
locative adverbs or locative adverbs (home, here, downhill) specify the direction 
degree or location of some action; degree adverbs (extremely, very, somewhat) 

manner specify the extent of some action, process, or property; manner adverbs 

(slowly, slinkily, delicately) describe the manner of some action or process; 
temporal and temporal adverbs describe the time that some action or event took place 
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( yesterday , Monday). Because of the heterogeneous nature of this class, 
some adverbs (for example temporal adverbs like Monday) are tagged in 
some tagging schemes as nouns. 

The closed classes differ more from language to language than do the 
open classes. Here’s a quick overview of some of the more important closed 
classes in English, with a few examples of each: 

• prepositions: on, under, over, near, by, at, from, to, with 

• determiners: a, an, the 

• pronouns: she, who, I, others 

• conjunctions: and, but, or, as, if, when 

• auxiliary verbs: can, may, should, are 

• particles: up, down, on, off, in, out, at, by, 

• numerals: one, two, three, first, second, third 

Prepositions occur before noun phrases; semantically they are rela- tions osi " 
tional, often indicating spatial or temporal relations, whether literal (on it , 
before then , by the house) or metaphorical (on time , with gusto , beside her¬ 
self). But they often indicate other relations as well (Hamlet was written by 
Shakespeare , and (from Shakespeare) i( And I did laugh sans intermission an 
hour by his dial”). Figure 8.1 shows the prepositions of English according 
to the CELEX on-line dictionary (Celex, 1993), sorted by their frequency in 
the COBUILD 16 million word corpus of English (?). Note that this should 
not be considered a definitive list. Different dictionaries and different tag 
sets may label word classes differently. This list combines prepositions and 
particles; see below for more on particles. 

A particle is a word that resembles a preposition or an adverb, and that particle 
often combines with a verb to form a larger unit called a phrasal verb, as in verb sal 
the following examples from Thoreau: 

So I went on for some days cutting and hewing timber... 

Moral reform is the effort to throw off sleep... 

We can see that these are particles rather than prepositions, for in the 
first example, on is followed, not by a noun phrase, but by a true preposition 
phrase. With transitive phrasal verbs, as in the second example, we can tell 
that off is a particle and not a preposition because particles may appear after 
their objects (throw sleep off as well as throw off sleep). This is not possible 
for prepositions (The horse went off its track , but *The horse went its track 

off). 
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of 

540,085 

through 

14,964 

worth 

1,563 

pace 

12 

in 

331,235 

after 

13,670 

toward 

1,390 

nigh 

9 

for 

142,421 

between 

13,275 

plus 

750 

re 

4 

to 

125,691 

under 

9,525 

till 

686 

mid 

3 

with 

124,965 

per 

6,515 

amongst 

525 

o’er 

2 

on 

109,129 

among 

5,090 

via 

351 

but 

0 

at 

100,169 

within 

5,030 

amid 

222 

ere 

0 

by 

77,794 

towards 

4,700 

underneath 

164 

less 

0 

from 

74,843 

above 

3,056 

versus 

113 

midst 

0 

about 

38,428 

near 

2,026 

amidst 

67 

o’ 

0 

than 

20,210 

off 

1,695 

sans 

20 

thru 

0 

over 

18,071 

past 

1,575 

circa 

14 

vice 

0 


Figure 8.1 Prepositions (and particles) of English from the CELEX on-line 
dictionary. Frequency counts are from the COBUILD 16 million word corpus. 


Quirk et al (1985a) gives the following list of single-word particles. 
Since it is extremely hard to automatically distinguish particles from prepo¬ 
sitions, some tag sets (like the one used for CELEX) do not distinguish them, 
and even in corpora that do (like the Penn Treebank) the distinction is very 
difficult to make reliably in an automatic process, so we do not give counts. 


aboard 

aside 

besides 

forward(s) 

opposite 

through 

about 

astray 

between 

home 

out 

throughout 

above 

away 

beyond 

in 

outside 

together 

across 

back 

by 

inside 

over 

under 

ahead 

before 

close 

instead 

overhead 

underneath 

alongside 

behind 

down 

near 

past 

up 

apart 

below 

east, etc 

off 

round 

within 

around 

beneath 

eastward(s),etc 

on 

since 

without 


Figure 8.2 English single-word particles from Quirk et al (1985a) 


articles A particularly small closed class is the articles: English has three: a , 

aiu and the (although this (as in this chapter) and that (as in that page) are 
often included as well). Articles often begin a noun phrase. A and an mark a 
noun phrase as indefinite, while the can mark it as definite. We will discuss 
definiteness in Chapter 18. Articles are quite frequent in English; indeed 
the is the most frequent word in most English corpora. Here are COBUILD 
statistics, again out of 16 million words: 
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the 1,071,676 
a 413,887 
an 59,359 

Conjunctions are used to join two phrases, clauses, or sentences. Co¬ 
ordinating conjunctions like and , or. or but , join two elements of equal sta¬ 
tus. Subordinating conjunctions are used when one of the elements is of 
some sort of embedded status. For example that in 7 thought that you might 
like some milk ’ is a subordinating conjunction that links the main clause / 
thought with the subordinate clause you might like some milk. This clause 
is called subordinate because this entire clause is the ‘content’ of the main 
verb thought. Subordinating conjunctions like that which link a verb to its 
argument in this way are also called complementizers. Chapter 9 and Chap¬ 
ter 11 will discuss complementation in more detail. Table 8.3 lists English 
conjunctions. 


and 

514,946 

yet 

5,040 

considering 

174 

forasmuch as 

0 

that 

134,773 

since 

4,843 

lest 

131 

however 

0 

but 

96,889 

where 

3,952 

albeit 

104 

immediately 

0 

or 

76,563 

nor 

3,078 

providing 

96 

in as far as 

0 

as 

54,608 

once 

2,826 

whereupon 

85 

in so far as 

0 

if 

53,917 

unless 

2,205 

seeing 

63 

inasmuch as 

0 

when 

37,975 

why 

1,333 

directly 

26 

insomuch as 

0 

because 

23,626 

now 

1,290 

ere 

12 

insomuch that 

0 

so 

12,933 

neither 

1,120 

notwithstanding 

3 

like 

0 

before 

10,720 

whenever 

913 

according as 

0 

neither nor 

0 

though 

10,329 

whereas 

867 

as if 

0 

now that 

0 

than 

9,511 

except 

864 

as long as 

0 

only 

0 

while 

8,144 

till 

686 

as though 

0 

provided that 

0 

after 

7,042 

provided 

594 

both and 

0 

providing that 

0 

whether 

5,978 

whilst 

351 

but that 

0 

seeing as 

0 

for 

5,935 

suppose 

281 

but then 

0 

seeing as how 

0 

although 

5,424 

cos 

188 

but then again 

0 

seeing that 

0 

until 

5,072 

supposing 

185 

either or 

0 

without 

0 


Figure 8.3 Coordinating and subordinating conjunctions of English from 
the CELEX on-line dictionary. Frequency counts are from the COBUILD 16 
million word corpus. 


Pronouns are forms that often act as a kind of shorthand for referring 
to some noun phrase or entity or event. Personal pronouns refer to per¬ 
sons or entities (you , she , /, it , me , etc). Possessive pronouns are forms of 
personal pronouns that indicate either actual possession or more often just 


CONJUNC¬ 

TIONS 


COMPLEMEN¬ 

TIZERS 


PRONOUNS 

POSSESSIVE 
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WH 


an abstract relation between the person and some object {my, your, his, her, 
its, one's, our, their). Wh-pronouns {what, who, whom, whoever) are used 
in certain question forms, or may also act as complementizers {Frieda, who 
I met five years ago... ). Table 8.4 shows English pronouns, again from 
CELEX. 


it 

199,920 

how 

13,137 

yourself 

2,437 

no one 

106 

I 

198,139 

another 

12,551 

why 

2,220 

wherein 

58 

he 

158,366 

where 

11,857 

little 

2,089 

double 

39 

you 

128,688 

same 

11,841 

none 

1,992 

thine 

30 

his 

99,820 

something 

11,754 

nobody 

1,684 

summat 

22 

they 

88,416 

each 

11,320 

further 

1,666 

suchlike 

18 

this 

84,927 

both 

10,930 

everybody 

1,474 

fewest 

15 

that 

82,603 

last 

10,816 

ourselves 

1,428 

thyself 

14 

she 

73,966 

every 

9,788 

mine 

1,426 

whomever 

11 

her 

69,004 

himself 

9,113 

somebody 

1,322 

whosoever 

10 

we 

64,846 

nothing 

9,026 

former 

1,177 

whomsoever 

8 

all 

61,767 

when 

8,336 

past 

984 

wherefore 

6 

which 

61,399 

one 

7,423 

plenty 

940 

whereat 

5 

their 

51,922 

much 

7,237 

either 

848 

whatsoever 

4 

what 

50,116 

anything 

6,937 

yours 

826 

whereon 

2 

my 

46,791 

next 

6,047 

neither 

618 

whoso 

2 

him 

45,024 

themselves 

5,990 

fewer 

536 

aught 

1 

me 

43,071 

most 

5,115 

hers 

482 

howsoever 

1 

who 

42,881 

itself 

5,032 

ours 

458 

thrice 

1 

them 

42,099 

myself 

4,819 

whoever 

391 

wheresoever 

1 

no 

33,458 

everything 

4,662 

least 

386 

you-all 

1 

some 

32,863 

several 

4,306 

twice 

382 

additional 

0 

other 

29,391 

less 

4,278 

theirs 

303 

anybody 

0 

your 

28,923 

herself 

4,016 

wherever 

289 

each other 

0 

its 

27,783 

whose 

4,005 

oneself 

239 

once 

0 

our 

23,029 

someone 

3,755 

thou 

229 

one another 

0 

these 

22,697 

certain 

3,345 

’un 

227 

overmuch 

0 

any 

22,666 

anyone 

3,318 

ye 

192 

such and such 

0 

more 

21,873 

whom 

3,229 

thy 

191 

whate’er 

0 

many 

17,343 

enough 

3,197 

whereby 

176 

whenever 

0 

such 

16,880 

half 

3,065 

thee 

166 

whereof 

0 

those 

15,819 

few 

2,933 

yourselves 

148 

whereto 

0 

own 

15,741 

everyone 

2,812 

latter 

142 

whereunto 

0 

us 

15,724 

whatever 

2,571 

whichever 

121 

whichsoever 

0 


Figure 8.4 Pronouns of English from the CELEX on-line dictionary. Fre¬ 
quency counts are from the COBUILD 16 million word corpus. 
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A closed class subtype of English verbs are the auxiliary verbs. Cross- auxiliary 
linguistically, auxiliaries are words (usually verbs) that mark certain seman¬ 
tic features of a main verb, including whether an action takes place in the 
present, past or future (tense), whether it is completed (aspect), whether it is 
negated (polarity), and whether an action is necessary, possible, suggested, 
desired, etc. (mood). 

English auxiliaries include the copula verb be , the two verbs do and copula 
have , along with their inflected forms, as well as a class of modal verbs. Be modal 
is called a copula because it connects subjects with certain kinds of predicate 
nominals and adjectives (He is a duck). The verb have is used for example 
to mark the perfect tenses (I have gone, I had gone), while be is used as part 
of the passive (We were robbed ), or progressive (We are leaving) construc¬ 
tions. The modals are used to mark the mood associated with the event or 
action depicted by the main verb. So can indicates ability or possibility, may 
indicates permission or possibility, must indicates necessity, etc. Figure 8.5 
gives counts for the frequencies of the modals in English. In addition to 
the copula have mentioned above, there is a modal verb have (e.g. I have 
to go), which is very common in spoken English. Neither it nor the modal 
verb dare, which is very rare, have frequency counts because the CELEX 
dictionary does not distinguish the main verb sense (I have three oranges. 

He dared me to eat them), from the modal sense (There has to be some mis¬ 
take, Dare I confront him?) from the non-modal auxiliary verb sense (I have 
never seen that). 



English also has many words of more or less unique function, includ¬ 
ing interjections (oh, ah, hey, man, alas), negatives (no, not), politeness tions jec ' 
markers (please, thank you), greetings (hello, goodbye ), and the existen- negatives 


GREETINGS 
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there tial there (there are two on the table ) among others. Whether these classes 
are assigned particular names or lumped together (as interjections or even 
adverbs) depends on the purpose of the labeling. 


8.2 Tagsets for English 

The previous section gave broad descriptions of the kinds of lexical classes 
that English words fall into. This section fleshes out that sketch by describ¬ 
ing the actual tagsets used in part-of-speech tagging, in preparation for the 
various tagging algorithms to be described in the following sections. 

There are a small number of popular tagsets for English, many of which 
evolved from the 87-tag tagset used for the Brown corpus (Francis, 1979; 
Francis and Kucera, 1982). Three of the most commonly used are the small 
45-tag Penn Treebank tagset (Marcus et al ., 1993), the medium-sized 61 tag 
C5 tagset used by the Lancaster UCREL project’s CLAWS (the Constituent 
Likelihood Automatic Word-tagging System) tagger to tag the British Na¬ 
tional Corpus (BNC) (Garside et al , 1997), and the larger 146-tag C7 tagset 
(Leech et al , 1994); the C5 and C7 tagsets are listed in Appendix C. (Also 
see Sampson (1987) and Garside et al (1997) for a detailed summary of the 
provenance and makeup of these and other tagsets.) This section will present 
the smallest of them, the Penn Treebank set, and then discuss specific addi¬ 
tional tags from some of the other tagsets that might be useful to incorporate 
for specific projects. 

The Penn Treebank tagset, shown in Figure 8.6, has been applied to 
the Brown corpus and a number of other corpora. Here is an example of a 
tagged sentence from the Penn Treebank version of the Brown corpus (in a 
flat ASCII file, tags are often represented after each word, following a slash, 
but tags can also be represented in various other ways): 

The/DT grand/JJ jury/NN commented/VBD on/LN a/DT num- 

ber/NN of/IN other/JJ topics/NNS ./. 

The Penn Treebank tagset was culled from the original 87-tag tagset 
for the Brown corpus. This reduced set leaves out information that can be 
recovered from the identity of the lexical item. For example the original 
Brown tagset and other large tagsets like C5 include a separate tag for each 
of the different forms of the verbs do (e.g. C5 tag ‘VDD’ for did and ‘VDG’ 
for doing), be, and have. These were omitted from the Penn set. 
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Tag 

Description 

Example 

Tag 

Description 

Example 

cc 

Coordin. Conjunction and, but, or 

SYM Symbol 

+,%, Sc 

CD 

Cardinal number 

one, two, three 

TO 

“to” 

to 

DT 

Determiner 

a, the 

UH 

Inteijection 

ah, oops 

EX 

Existential ‘there’ 

there 

YB 

Verb, base form 

eat 

FW 

Foreign word 

mea culpa 

YBD 

Verb, past tense 

ate 

IN 

Preposition/sub-conj 

of in, by 

YBG 

Verb, gerund 

eating 

JJ 

Adjective 

yellow 

YBN Verb, past participle eaten 

JJR 

Adj., comparative 

bigger 

YBP 

Verb, non-3sg pres 

eat 

JJS 

Adj., superlative 

wildest 

YBZ 

Verb, 3sg pres 

eats 

LS 

List item marker 

1, 2, One 

WDT Wh-determiner 

which, that 

MD 

Modal 

can, should 

WP 

Wh-pronoun 

what, who 

NN 

Noun, sing, or mass 

llama 

WP$ 

Possessive wh- 

whose 

NNS 

Noun, plural 

llamas 

WRB Wh-adverb 

how, where 

NNP 

Proper noun, singular 

IBM 

$ 

Dollar sign 

$ 

NNPS Proper noun, plural 

Caro Unas 

# 

Pound sign 

# 

PDT 

Predeterminer 

all, both 


Left quote 

(‘ or “) 

POS 

Possessive ending 

’s 

V> 

Right quote 

(’ or ”) 

PP 

Personal pronoun 

I, you, he 

( 

Left parenthesis 

( t, (, t, o 

pp$ 

Possessive pronoun 

your, one's 

) 

Right parenthesis 

a), >,» 

RB 

Adverb 

quickly, never 

, 

Comma 

, 

RBR 

Adverb, comparative faster 


Sentence-final punc 

(. ! ?) 

RBS 

Adverb, superlative 

fastest 


Mid-sentence punc 

0;...--) 

RP 

Particle 

up, off 




Figure 8.6 Penn Treebank Part-of-Speech Tags (Including Punctuation) 


Certain syntactic distinctions were not marked in the Penn Treebank 
tagset because Treebank sentences were parsed, not merely tagged, and so 
some syntactic information is represented in the phrase structure. For ex¬ 
ample, prepositions and subordinating conjunctions were combined into the 
single tag IN , since the tree-structure of the sentence disambiguated them 
(subordinating conjunctions always precede clauses, prepositions precede 
noun phrases or prepositional phrases). 

Most tagging situations, however, do not involve parsed corpora; for 
this reason the Penn Treebank set is not specific enough for many uses. The 
C7 tagset, for example, also distinguishes prepositions (II) from subordi¬ 
nating conjunctions (CS ), and distinguishes the preposition to (II) from the 
infinite marker to (TO). 

Which tagset to use for a particular application depends, of course, on 
how much information the application needs. The reader should see Ap¬ 
pendix C for a listing of the C5 and Cl tagsets. 
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8.3 Part of Speech Tagging 

tagging Part-of-speech tagging (or just tagging for short) is the process of assigning 
a part-of-speech or other lexical class marker to each word in a corpus. Tags 
are also usually applied to punctuation markers; thus tagging for natural lan¬ 
guage is the same process as tokenization for computer languages, although 
tags for natural languages are much more ambiguous. As we suggested at 
the beginning of the chapter, taggers play an increasingly important role in 
speech recognition, natural language parsing and information retrieval. 

The input to a tagging algorithm is a string of words and a specified 
tagset tagset of the kind described in the previous section. The output is a single 
best tag for each word. For example, here are some sample sentences from 
the ATIS corpus of dialogues about air-travel reservations that we will dis¬ 
cuss in Chapter 9. For each we have shown a potential tagged output using 
the Penn Treebank tagset defined in Figure 8.6 on page 295: 

VB dt nn . 

Book that flight. 

VBZ DT NN VB NN ? 

Does that flight serve dinner ? 


Even in these simple examples, automatically assigning a tag to each 
ambiguous word is not trivial. For example, book is ambiguous. That is, it has more 
than one possible usage and part of speech. It can be a verb (as in book that 
flight or to book the suspect) or a noun (as in hand me that book , or a book 
of matches). Similarly that can be a determiner (as in Does that flight serve 
dinner ), or a complementizer (as in / thought that your flight was earlier). 
resolve The problem of POS-tagging is to resolve these ambiguities, choosing the 
proper tag for the context. Part-of-speech tagging is thus one of the many 
disambiguation tasks we will see in this book. 

How hard is the tagging problem? Most words in English are unam¬ 
biguous; i.e. they have only a single tag. But many of the most common 
words of English are ambiguous (for example can can be an auxiliary (‘to 
be able’), a noun (‘a metal container’), or a verb (‘to put something in such 
a metal container’)). In fact DeRose (1988) reports that while only 11.5% 
of English word types in the Brown Corpus are ambiguous, over 40% of 
Brown tokens are ambiguous. Based on Francis and Kucera (1982), he gives 
the table of tag ambiguity in Figure 8.7. 
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Unambiguous (1 tag) 

35,340 

Ambiguous (2-7 tags) 

4,100 

2 tags 

3,760 

3 tags 

264 

4 tags 

61 

5 tags 

12 

6 tags 

2 

7 tags 

1 (“still”) 

Figure 8.7 The number of word types in Brown corpus by degree of ambi¬ 
guity (after DeRose (1988)). 


Luckily, it turns out that many of the 40% ambiguous tokens are easy 
to disambiguate. This is because the various tags associated with a word 
are not equally likely. For example, a can be a determiner, or the letter a 
(perhaps as part of an acronym or an initial). But the determiner sense of a 
is much more likely. 

Most tagging algorithms fall into one of two classes: rule-based tag¬ 
gers and stochastic taggers. Rule-based taggers generally involve a large 
database of hand-written disambiguation rule which specify, for example, 
that an ambiguous word is a noun rather than a verb if it follows a de¬ 
terminer. The next section will describe a sample rule-based tagger, EN- 
GTWOL, based on the Constraint Grammar architecture of Karlsson et al. 

(1995). 

Stochastic taggers generally resolve tagging ambiguities by using a 
training corpus to compute the probability of a given word having a given 
tag in a given context. Section 8.5 describes a stochastic tagger called HMM 
tagger, also called a Maximum Likelihood Tagger, or a Markov model hmm tagger 
tagger, based on the Hidden Markov Model presented in Chapter 7. 

Finally, Section 8.6 will describe an approach to tagging called the 
transformation-based tagger or the Brill tagger, after Brill (1995). The tagger 
B rill tagger shares features of both tagging architectures. Like the rule-based 
tagger, it is based on rules which determine when an ambiguous word should 
have a given tag. Like the stochastic taggers, it has a machine-learning com¬ 
ponent: the rules are automatically induced from a previously-tagged train¬ 
ing corpus. 
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8.4 Rule-based Part-of-speech Tagging 


The earliest algorithms for automatically assigning part-of-speech were based 
on a two-stage architecture (Harris, 1962; Klein and Simmons, 1963; Greene 
and Rubin, 1971). The first stage used a dictionary to assign each word a list 
of potential parts of speech. The second stage used large lists of hand-written 
disambiguation rules to winnow down this list to a single part-of-speech for 
each word. 

engtwol The ENGTWOL tagger (Voutilainen, 1995) is based on the same two- 

stage architecture, although both the lexicon and the disambiguation rules 
are much more sophisticated than the early algorithms. The ENGTWOL 
lexicon is based on the two-level morphology described in Chapter 3, and 
has about 56,000 entries for English word stems (Heikkila, 1995), counting 
a word with multiple parts of speech (e.g. nominal and verbal senses of hit ) 
as separate entries, and of course not counting inflected and many derived 
forms. Each entry is annotated with a set of morphological and syntactic 
features. Figure 8.8 shows some selected words, together with a slightly 
simplified listing of their features. 


Word 

POS 

Additional POS features 

smaller 

ADJ 

COMPARATIVE 

entire 

ADJ 

ABSOLUTE ATTRIBUTIVE 

fast 

ADV 

SUPERLATIVE 

that 

DET 

CENTRAL DEMONSTRATIVE SG 

all 

DET 

PREDETERMINER SGI PL QUANTIFIER 

dog’s 

N 

GENITIVE SG 

furniture 

N 

NOMINATIVE SG NOINDEFDETERMINER 

one-third 

NUM 

SG 

she 

PRON 

PERSONAL FEMININE NOMINATIVE SG3 

show 

V 

IMPERATIVE VEIN 

show 

V 

PRESENT -SG3 VEIN 

show 

N 

NOMINATIVE SG 

shown 

PCP2 

SVOO SVO SV 

occurred 

PCP2 

SV 

occurred 

V 

PAST VEIN S V 

Figure 8.8 

Sample lexical entries from the ENGTWOL lexicon described 

in Voutilainen (1995) and Heikkila (1995). 


Most of the features in Figure 8.8 are relatively self-explanatory; SG 
for singular, -SG3 for other than third-person-singular. ABSOLUTE means 
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non-comparative and non-superlative for an adjective, NOMINATIVE just 
means non-genitive, and PCP2 means past participle. PRE, CENTRAL, 
and POST are ordering slots for determiners (predeterminers (all) come be¬ 
fore determiners (the): all the president's men). NOENDEFDETERMENER 
means that words like furniture do not appear with the indefinite determiner 
a. SV, SVO, and SVOO specify the subcategorization or complementa- ration 60 
tion pattern for the verb. Subcategorization will be discussed in Chapter 9 wion lemen 
and Chapter 11, but briefly SV means the verb appears solely with a subject 
(nothing occurred ); SVO with a subject and an object (I showed the film); 

SVOO with a subject and two complements: She showed her the ball. 

In the first stage of the tagger, each word is run through the two-level 
lexicon transducer and the entries for all possible parts of speech are re¬ 
turned. For example the phrase Pavlov had shown that salivation. .. would 
return the following list (one line per possible tag, with the correct tag shown 
in boldface): 


Pavlov PAVLOV N NOM SG PROPER 
had HAVE V PAST VFIN SVO 

HAVE PCP2 SVO 

shown SHOW PCP2 SVOO SVO SV 

that ADV 

PRON DEM SG 

DET CENTRAL DEM SG 

CS 

salivation N NOM SG 


A set of about 1,100 constraints are then applied to the input sentence 
to rule out incorrect parts of speech; the boldfaced entries in the table above 
show the desired result, in which the preterite (not participle) tag is applied 
to had , and the complementizer (CS) tag is applied the that. The constraints 
are used in a negative way, to eliminate tags that are inconsistent with the 
context. For example one constraint eliminates all readings of that except 
the ADV (adverbial intensifier) sense (this is the sense in the sentence it isn't 
that odd). Here’s a simplified version of the constraint: 
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Adverbial-that rule 

Given input: ’’that” 
if 

(+1 A/ADV/QUANT); / * if next word is adj, adverb, or quantifier * / 

(+2 SENT-LIM); / * and following which is a sentence boundary, * / 

(NOT -1 SYOC/A); / * and the previous word is not a verb like * / 

/ * ’ consider’ which allows adjs as object complements * / 
then eliminate non-ADV tags 
else eliminate ADV tag 

The first two clauses of this rule check to see that the that directly 
precedes a sentence-final adjective, adverb, or quantifier. In all other cases 
the adverb reading is eliminated. The last clause eliminates cases preceded 
by verbs like consider or believe which can take a noun and an adjective; this 
is to avoid tagging the following instance of that as an adverb: 

I consider that odd. 

Another rule is used to express the constraint that the complementizer 
sense of that is most likely to be used if the previous word is a verb which ex¬ 
pects a complement (like believe , think , or show), and if the that is followed 
by the beginning of a noun phrase, and a finite verb. 

This description oversimplifies the ENGTWOL architecture; the sys¬ 
tem also includes probabilistic constraints, and also makes use of other syn¬ 
tactic information we haven’t discussed. The interested reader should con¬ 
sult Karlsson et al (1995). 


8.5 Stochastic Part-of-speech Tagging 


The use of probabilities in tags is quite old; probabilities in tagging were 
first used by (Stolz et al , 1965), a complete probabilistic tagger with Yiterbi 
decoding was sketched by Bahl and Mercer (1976), and various stochastic 
taggers were built in the 1980’s (Marshall, 1983; Garside, 1987; Church, 
1988; DeRose, 1988). This section describes a particular stochastic tagging 
algorithm generally known as the Hidden Markov Model or HMM tagger. 
The intuition behind all stochastic taggers is a simple generalization of the 
‘pick the most-likely tag for this word’ approach that we discussed above, 
based on the Bayesian framework we saw in Chapter 5. 

For a given sentence or word sequence, HMM taggers choose the tag 
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sequence that maximizes the following formula: 

P(word|tag) * P(tag|previous n tags) (8.1) 

The rest of this section will explain and motivate this particular equa¬ 
tion. HMM taggers generally choose a tag sequence for a whole sentence 
rather than for a single word, but for pedagogical purposes, let’s first see 
how an HMM tagger assigns a tag to an individual word. We first give the 
basic equation, then work through an example, and, finally, give the motiva¬ 
tion for the equation. 

A bigram-HMM tagger of this kind chooses the tag ti for word w; that 
is most probable given the previous tag U-\ and the current word wf. 

ti = argmax P(tj\ U- 1 , vt 7 /) (8.2) 

j 

Through some simplifying Markov assumptions that we will give below, we 
restate Equation 8.2 to give the basic HMM equation for a single tag as 
follows: 

U = argmax P(tj\ti-i)P(wi\tj) (8.3) 

j 

A Motivating Example 

Let’s work through an example, using an HMM tagger to assign the proper 
tag to the single word race in the following examples (both shortened slightly 
from the Brown corpus): 

(8.4) Secretariat/NNP is/VBZ expected/VBN to/TO race/VB 
tomorrow/NN 

(8.5) People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN 
for/IN the/DT race/NN for/LN outer/JJ space/NN 

In the first example race is a verb (VB), in the second a noun (NN). 
For the purposes of this example, let’s pretend that some other mech¬ 
anism has already done the best tagging job possible on the surrounding 
words, leaving only the word race untagged. A bigram version of the HMM 
tagger makes the simplifying assumption that the tagging problem can be 
solved by looking at nearby words and tags. Consider the problem of as¬ 
signing a tag to race given just these subsequences: 

to/TO race/??? 
the/DT race/??? 

Let’s see how this equation applies to our example with race; Equa¬ 
tion 8.3 says that if we are trying to choose between NN and VB for the 
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sequence to race , we choose the tag that has the greater of these two proba¬ 
bilities: 

P(YB|TO)P(race|YB) (8.6) 

and 

P(NN|TO)P(race|NN) (8.7) 

Equation 8.3 and its instantiations Equations 8.6 and 8.7 each have 
two probabilities: a tag sequence probability P(^|6-i) and a word-likelihood 
P(wi\tj). For race , the tag sequence probabilities P(NN|TO) and P(VB|TO) 
give us the answer to the question “how likely are we to expect a verb (noun) 
given the previous tag?”. They can just be computed from a corpus by count¬ 
ing and normalizing. We would expect that a verb is more likely to follow 
TO than a noun is, since infinitives (to race, to run, to eat ) are common in 
English. While it is possible for a noun to follow TO (walk to school, related 
to hunting ), it is less common. 

Sure enough, a look at the combined Brown and Switchboard corpora 
gives us the following probabilities, showing that verbs are fifteen times as 
likely as nouns after TO: 

P(NN|TO) = .021 

P(VB|TO) = .34 

The second part of Equation 8.3 and its instantiations Equations 8.6 
and 8.7 is the lexical likelihood: the likelihood of the noun race given each 
tag, P(race| YB) and P(race|NN). Note that this likelihood term is not asking 
‘which is the most likely tag for this word’. That is, the likelihood term is not 
P(VB|race). Instead we are computing P(race|VB). The probability, slightly 
counterintuitively, answers the question “if we were expecting a verb, how 
likely is it that this verb would be race”. 

Here are the lexical likelihoods from the combined Brown and Switch¬ 
board corpora: 


P(race|NN) = .00041 
P(race|VB) = .00003 

If we multiply the lexical likelihoods with the tag sequence probabili¬ 
ties, we see that even the simple bigram version of the HMM tagger correctly 
tags race as a YB despite the fact that it is the less likely sense of race : 
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P(VB | TO)P(race | VB) = .00001 
P(NN|TO)P(race|NN) = .000007 


The Actual Algorithm for HMM tagging 

Recall that a real HMM tagger isn’t just choosing the best tag for a simple 
word but the best sequence of tags for an entire sentence. Now that we’ve 
seen the single-word intuition for the HMM tagger, let’s give the complete 
equations. Generally, we make the Viterbi approximation and choose the 
most probable tag sequence for each sentence. This approach thus assumes 
that we are trying to compute for each sentence the most probable sequence 
of tags T = given the sequence of words in the sentence (W): 

T = argmaxP(r|TT) 

Tex 

By Bayes Law, P(T\W) can be expressed as: 

ppm = rmnwm 

v 1 } P{W) 

Thus we are attempting to choose the sequence of tags that maximizes 


T argmax 

Tex 


P(T)P(W\T) 

P(W) 


Since we are looking for the most likely tag sequence for a sentence 
given a particular word sequence, the probability of the word sequence P{W) 
will be the same for each tag sequence and we can ignore it. 

f = argmaxP(r)P(W|r) 

Tex 

From the chain rule of probability: 


P{T)P(W\T) -ii« 

i= 1 

We make the TV-gram assumption again, just as we did in Equation 6.9 
for modeling the probability of word sequences. The trigram model is most 
often used, so let’s define that one. First, we make the simplifying assump¬ 
tion that the probability of a word is dependent only its tag: 


P(Wi\W]t] ... Vi';_ , , tj ) = p(Wi\ti) 
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Next, we make the assumption that the tag history can be approximated 
by the most recent two tags: 

P(ti\W\t\ ...Wi-iti-i) = 

Thus we are choosing the tag sequence that maximizes: 

1) ^(^2 kl ) ri 1 fi - 2^-1 ) O w, I )] 

i =3 i= 1 

As usual, we can use maximum likelihood estimation from relative 
frequencies to estimate these probabilities. 


P{ti\ti-2ti-\) 


P(Wt\t t ) 


c(Wj,ti) 

c{n) 


This model can also be smoothed (for example by the backoff or deleted 
interpolation algorithms of Chapter 6) to avoid zero probabilities. 

Finding the most probable tag sequence can be done with the Viterbi 
algorithm described in Chapter 7. 

Weischedel et al (1993) and DeRose (1988) have reported accuracies 
of above 96% for this algorithm. 

The HMM tagger we have seen so far is trained on hand-tagged data. 
Kupiec (1992), Cutting et al (1992a), and others show that it is also possible 
to train an HMM tagger on unlabeled data, using the EM algorithm of Chap¬ 
ter 7 and Appendix D. These taggers still start with a dictionary which lists 
which tags can be assigned to which words; the EM algorithm then learns 
the word likelihood function for each tag, and the tag transition probabili¬ 
ties. An experiment by Merialdo (1994), however, indicates that with even a 
small amount of training data, a tagger trained on hand-tagged data worked 
better than one trained via EM. Thus the EM-trained ‘pure HMM’ tagger is 
probably best suited in cases where no training data is available, for example 
when tagging languages for which there is no previously hand-tagged data. 


8.6 Transformation-Based Tagging 

Transformation-Based Tagging, sometimes called Brill tagging, is an in- 
BASED F0RMATI0N stance of the Transformation-Based Learning (TBL) approach to machine 

LEARNING & 11 

learning (Brill, 1995), and draws inspiration from both the rule-based and 
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Methodology Box: Evaluating Taggers 

Taggers arc often evaluating by comparing them with a human- 
labeled Gold Standard test set, based on percent correct: the per¬ 
centage of all tags in the test set where the tagger and the Gold 
standard agree. Most current lagging algorithms have an accuracy 
(percent-correct) of around 96% to 97% for simple tagsets like the 
Penn Treebank set; human annotators can then be used to manually 
post-process the tagged corpus. 

How good is 96%? Since tag sets and tasks differ, the perfor¬ 
mance of tags can be compared against a lower-bound baseline and 
an upper-bound ceiling. One way to set a ceiling is to see how well 
humans do on the task. Marcus et al. (1993), for example, found that 
human annotators agreed on about 96-97% of the tags in the Penn 
Treebank version of the Brown Corpus. This suggests that the Gold 
Standard may have a 3-4% margin of error, and that it is not possi¬ 
ble to get 100% accuracy. Two experiments by Voutilainen (1995, 
p, 174), however, found that if humans were allowed to discuss the 
tags, they reached consensus on 100% of the tags. 

Key Concept #6. Human Ceiling: When using a human 
Gold Standard to evaluate a classification algorithm, check the 
agreement rate of humans on the standard. 

The standard baseline, suggested by Gale et al. (1992) (in 
the slightly different context of word-sense disambiguation), is to 
choose the unigram most-likely tag for each ambiguous word. The 
mosl-likely tag for each word can be computed from a hand-tagged 
corpus (which may be the same as the training corpus for the tagger 
being evaluated). 

Key Concept #7. Unigram Baseline: When designing a new 
classification algorithm, always compare it against the unigram 
baseline (assigning each token to the class it occurred in most 
often in the training set). 

Chamiak et al. (1993) showed that a (slightly smoothed) version 
of this baseline algorithm achieves an accuracy of 90-91%! Tagging 
algorithms since Harris (1962) have incorporated this intuition about 
tag-frequency. 
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stochastic taggers. Like the rule-based taggers, TBL is based on rules that 
specify what tags should be assigned to what words. But like the stochastic 
taggers, TBL is a machine learning technique, in which rules are automati¬ 
cally induced from the data. Like some but not all of the HMM taggers, TBL 
is a supervised learning technique; it assumes a pre-tagged training corpus. 

Samuel et al (1998a) offer a useful analogy for understanding the TBL 
paradigm, which they credit to Terry Harvey. Imagine an artist painting a 
picture of a white house with green trim against a blue sky. Suppose most of 
the picture was sky, and hence most of the picture was blue. The artist might 
begin by using a very broad brush and painting the entire canvas blue. Next 
she might switch to a somewhat smaller white brush, and paint the entire 
house white. She would just color in the whole house, not worrying about 
the brown roof, or the blue windows or the green gables. Next she takes a 
smaller brown brush and colors over the roof. Now she takes up the blue 
paint on a small brush and paints in the blue windows on the bam. Finally 
she takes a very fine green brush and does the trim on the gables. 

The painter starts with a broad bmsh that covers a lot of the canvas 
but colors a lot of areas that will have to be repainted. The next layer col¬ 
ors less of the canvas, but also makes less ‘mistakes’. Each new layer uses 
a finer bmsh that corrects less of the picture, but makes fewer mistakes. 
TBL uses somewhat the same method as this painter. The TBL algorithm 
has a set of tagging rules. A corpus is first tagged using the broadest rule, 
i.e. the one that applies to the most cases. Then a slightly more specific 
rule is chosen, which changes some of the original tags. Next an even nar¬ 
rower rule, which changes a smaller number of tags (some of which might 
be previously-changed tags). 

How TBL rules are applied 

Let’s look at one of the rules used by Brill’s (1995) tagger. Before the rules 
apply, the tagger labels every word with its most-likely tag. We get these 
most-likely tags from a tagged corpus. For example, in the Brown corpus, 
race is most likely to be a noun: 


P(NN|race) = .98 
P(VB|race) = .02 

This means that the two examples of race that we saw above will both 
be coded as NN. In the first case, this is a mistake, as NN is the incorrect 
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tag: 

(8.8) is/VBZ expected/VBN to/TO race/NN tomorrow/NN 

In the second case this race is correctly tagged as an NN: 

(8.9) the/DT race/NN for/IN outer/JJ space/NN 

After selecting the most-likely tag, Brill’s tagger applies its transfor¬ 
mation rules. As it happens, Brill’s tagger learned a rule that applies exactly 
to this mistagging of race : 

Change NN to VB when the previous tag is TO 

This rule would change race/NN to race/VB in exactly the following 
situation, since it is preceded by to/TO : 

(8.10) expected/VBN to/TO race/NN —> expected/VBN to/TO race/VB 

How TBL Rules are Learned 

Brill’s TBL algorithm has three major stages. It first labels every word with 
its most-likely tag. It then examines every possible transformation, and se¬ 
lects the one that results in the most improved tagging. Finally, it then re-tags 
the data according to this rule. These three stages are repeated until some 
stopping criterion is reached, such as insufficient improvement over the pre¬ 
vious pass. Note that stage two requires that TBL knows the correct tag of 
each word; i.e., TBL is a supervised learning algorithm. 

The output of the TBL process is an ordered list of transformations; 
these then constitute a ‘tagging procedure’ that can be applied to a new cor¬ 
pus. In principle the set of possible transformations is infinite, since we 
could imagine transformations such as “transform NN to VB if the previous 
word was ‘IBM’ and the word ‘the’ occurs between 17 and 158 words before 
that”. But TBL needs to consider every possible transformation, in order to 
pick the best one on each pass through the algorithm. Thus the algorithm 
needs a way to limit the set of transformations. This is done by designing 
a small set of templates, abstracted transformations. Every allowable trans- templates 
formation is an instantiation of one of the templates. Brill’s set of templates 
is listed in Figure 8.9. Figure 8.10 gives the details of this algorithm for 
learning transformations. 

At the heart of Figure 8.10 are the two functions Get_Best_Trans- 
formation and Get_Best_Instance. Get_Best_Transformation 
is called with a list of potential templates; for each template, it calls 
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The preceding (following) word is tagged z. 

The word two before (after) is tagged z. 

One of the two preceding (following) words is tagged z. 

One of the three preceding (following) words is tagged z. 

The preceding word is tagged z and the following word is tagged w. 

The preceding (following) word is tagged z and the word 
_ two before (after) is tagged w, _ 

Figure 8.9 Brill’s (1995) templates. Each begins with ‘Change tag a to tag 
b when: \ The variables a, b, z, and w range over parts of speech. 


Get_Best_Instance. Get_Best_Instance iteratively tests every pos¬ 
sible instantiation of each template by filling in specific values for the tag 
variables a, b, z and w. 

In practice, there are a number of ways to make the algorithm more 
efficient. For example, templates and instantiated transformations can be 
suggested in a data-driven manner; a transformation-instance might only be 
suggested if it would improve the tagging of some specific word. The search 
can also be made more efficient by pre-indexing the words in the training 
corpus by potential transformation. Roche and Schabes (1997a) show how 
the tagger can also be speeded up by converting each rule into a finite-state 
transducer and composing all the transducers. 

Figure 8.11 shows a few of the rules learned by BrilTs original tagger. 


8.7 Other Issues 

Multiple tags and multiple words 

Two issues that arise in tagging are tag indeterminacy and multi-part words. 
Tag indeterminacy arises when a word is ambiguous between multiple tags 
and it is impossible or very difficult to disambiguate. In this case, some 
taggers allow the use of multiple tags. This is the case in the Penn Treebank 
and in the British National Corpus. Common tag indeterminacies include ad¬ 
jective versus preterite versus past participle (JJ/VBD/VBN), and adjective 
versus noun as prenominal modifier (JJ/NN). 

The second issue concerns multi-part words. The C5 and Cl tagsets, 
for example, allow prepositions like ‘in terms of to be treated as a single 
word by adding numbers to each tag: 
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function TBL {corpus) returns transforms-queue 
lNTIALIZE-WITH-M0ST-LIKELY-TAGS(C0r/7M5) 
until end condition is met do 

temp/ato^GENERATE-POTENTIAL-RELEVANT-TEMPLATES 
best-transform Get-B EST-TRANSFORM(ccrpM^, templates ) 
ApPLY-TRANSFORM(6m-tams/hrm, corpus) 

En queue (best-transform-rule, transforms-queue ) 

end 

return (transforms-queue) 


function Get-Best-Transform (corpus, templates) returns transform 
for each template in templates 
(instance, score) <- GET-BEST-lNSTANCE(c<9rpws, template) 
if (score > best-transform.score) then best-transform <—(instance, score) 
return (best-transform) 


function G et- B est-Instan CE(corpus, template) returns transform 
for from-tag <-from tag— 1 to tag—n do 
for to-tag from tag— 1 to tag—n do 
for pos<- from 1 to corpus-size do 
if (correct-tag(pos) == to-tag 8eSe current-tag(pos) ==from-tag) 
num-good-transforms(current-tag(pos—l))-\-+ 
elseif (correct-tag(pos)==from-tag Me current-tag(pos)==from-tag) 
num-bad-transforms(current-tag(pos—l))++ 

end 

best-Z ARGMAXt (num-good-transforms(t) - num-bad-transforms(t)) 
\f(num-good-transforms(best-Z) - num-bad-transforms(best-Z) 

> best-instance.Z) then 

best-instance “Change tag from from-tag to to-tag 
if previous tag is best-Z ’ 

return (best-instance) 


procedure ApPLY-TRANSFORM(tams/hrra, corpus) 
for pos i — from 1 to corpus-size do 
if (current-tag(pos)==best-rule-from) 

Me (current-tag(pos— l)==best-rule-prev)) 
current-tag(pos) - best-rule-to 


Figure 8.10 The TBL algorithm for learning to tag. Get_Best Jnstance 
would have to change for transformations templates other than ‘Change tag 
from XtoY if previous tag is Z\ After Brill (1995). 
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HAPAX 

LEGOMENA 



Change tags 



# 

From 

To 

Condition 

Example 

1 

NN 

YB 

Previous tag is TO 

to/TO race/NN YB 

2 

VBP 

YB 

One of the previous 3 tags is MD 

might/MD vanish/YBP —> YB 

3 

NN 

YB 

One of the previous 2 tags is MD 

might/MD not reply/NN —> YB 

4 

VB 

NN 

One of the previous 2 tags is DT 


5 

YBD 

YBN 

One of the previous 3 tags is YBZ 


Figure 8.11 

The first 20 nonlexicalized transformations from the Brill tag- 

ger (Brill, 1995). 



in/II31 terms/1132 of/II33 

Finally, some tagged corpora split certain words; for example the Penn 
Treebank and the British National Corpus splits contractions and the \s- 
genitive from their stems: 

would/MD n’t/RB 
children/NNS ’s/POS 


Unknown words 

All the tagging algorithms we have discussed require a dictionary that lists 
the possible parts of speech of every word. But the largest dictionary will 
still not contain every possible word, as we saw in Chapter 4. Proper names 
and acronyms are created very often, and even new common nouns and verbs 
enter the language at a surprising rate. Therefore in order to build a complete 
tagger we need some method for guessing the tag of an unknown word. 

The simplest possible unknown-word algorithm is to pretend that each 
unknown word is ambiguous among all possible tags, with equal probability. 
Then the tagger must rely solely on the contextual POS-trigrams to suggest 
the proper tag. A slightly more complex algorithm is based on the idea that 
the probability distribution of tags over unknown words is very similar to the 
distribution of tags over words that occurred only once in a training set. an 
idea that was suggested by both Baayen and Sproat (1996) and Dermatas and 
Kokkinakis (1995). These words that only occur once are known as hapax 
legomena (singular hapax legomenon). For example, unknown words and 
hapax legomena are similar in that they are both most likely to be nouns, 
followed by verbs, but are very unlikely to be determiners or interjections. 
Thus the likelihood P(wi\ti) for an unknown word is determined by the av¬ 
erage of the distribution over all singleton words in the training set. (Recall 
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Methodology Box: Error Analysis 


In order to improve a computational model wc need to analyze 
and understand where it went wrong. Analyzing the error in a pattern 
classifier like a parLof-speech tagger is usually done via a confusion 
matrix, also called a contingency table. A confusion matrix for an 
N-way classification task is an N-by-N matrix where the cell (a.} 1 ) 
contains the number of times an item with correction classification 
x was classified by the model as >\ For example, the following ta¬ 
bic shows a portion of the confusion matrix from the HMM lagging 
experiments of Franz (19%). The row labels indicate correct tags, 
column labels indicate the tagger’s hypothesized tags, and each cell 
indicates percentage of the overall Lagging error. Thus 4.4% of the 
total errors were caused by mislagging a VBD as a VBN. Common 
errors are boldfaced in the table. 



IN 

JJ 

NN 

NNP 

RB 

VBD 

VBN 

IN 

- 

.2 



.7 



JJ 

.2 

- 

3.3 

2.1 

1.7 

.2 

2.7 

NN 


8.7 

- 




.2 

NNP 

.2 

3.3 

4.1 

- 

.2 



RB 

2.2 

2.0 

,5 


- 



VBD 


.3 

,5 



- 

4.4 

VBN 


2.8 




2.6 

- 


The confusion matrix above, and related error analyses in Franz 
(1996), Kupiec (1992), and Ralnaparkhi (1996), suggest that some 
major problems facing current taggers are: 

1. NN versus NNP versus JJ: These are hard to distinguish 
prenominally. Distinguishing proper nouns is especially im¬ 
portant for information extraction and machine translation. 

2. RP versus RB versus IN: All of these can appear in sequences 
of satellites immediately following the verb. 

3. VBD versus VBN versus JJ: Distinguishing these is impor¬ 
tant for partial parsing (participles are used to find passives), 
and for correctly labeling the edges of noun-phrases. 
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that this idea of using Things we’ve seen once’ as an estimator for Things 
we’ve never seen’ proved useful as key concept Things Seen Once in the 
Witten-Bell and Good-Turing algorithms of Chapter 6). 

The most powerful unknown-word algorithms make use of information 
about how the word is spelled. For example, words that end in the letter -s are 
likely to be plural nouns (NNS), while words ending with -ed tend to be past 
participles (VBN). Words starting with capital letters are likely to be nouns. 
Weischedel et al. (1993) used four specific kinds of orthographic features: 3 
inflectional endings (- ed, -s, -ing ), 32 derivational endings (such as -ion, -al, 
-ive, and -ly), 4 values of capitalization (capitalized initial, capitalized non¬ 
initial, etc.), and hyphenation. They used the following equation to compute 
the likelihood of an unknown word: 

P(wi\ti) = /?(unknown-word|^) */? (capital^;) * /?(endings/hyph|f;) 

Other researchers, rather than relying on these hand-designed features, 
have used machine learning to induce useful features. Brill (1995) used the 
TBL algorithm, where the allowable templates were defined orthographi- 
cally (the first N letters of the words, the last N letters of the word, etc). 
His algorithm induced all the English inflectional features, hyphenation, and 
many derivational features such as -ly, al. Franz (1996) uses a loglinear 
model which includes more features, such as the length of the word and 
various prefixes, and furthermore includes interaction terms among various 
features. 

Class-based N-grams 

Now that we have a way of automatically assigning a class to each word in 
a corpus, we can use this information to augment our TV-gram models. The 
n L gram BASED class-based N-gram is a variant of the TV-gram that uses the frequency of 
sequences of POS (or other) classes to help produce a more knowledgeable 
estimate of the probability of word strings. The basic class-based TV-gram 
defines the conditional probability of a word w n based on its history as the 
product of the two factors: the probability of the class given the preceding 
classes (based on a Af-gram-of-classes), and the probability of a particular 
word given the class: 

P(w n |<:t+ 1 ) =P(w„ I c n )p(c n IJr +1 ) 

The maximum likelihood estimate (MLE) of the probability of the 
word given the class and the probability of the class given the previous class 
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Methodology Box: Computing Agreement via k 


One problem with the percent correct metric for evaluating 
taggers is that it doesn't control for how easy the tagging task is. 
If 99% of the tags arc, say, NN, then getting 99% correct isn’t very 
good; wc could have gotten 99% correct just by guessing NN. This 
means that it’s really impossible to compare taggers which are being 
run on different test sets or different tasks. As the previous method¬ 
ology box noted, one factor that can help normalize different values 
of percent correct is to measure the difficulty of a given task via die 
unigram baseline for that task. 

In fact, there is an evaluation statistic called kappa (tc) that takes 
this baseline into account, inherently controlling for the complex¬ 
ity of the task (Siegel and Castellan, 1988; Carletta, 1996). Kappa 
can be used instead of percent correct when comparing a tagger to 
a Gold Standard, or especially when comparing human labelers to 
each other, when there is no one correct answer. Kappa is the ratio of 
the proportion of times that 2 classifiers agree (corrected for chance 
agreement) to the maximum proportion of times that the classifiers 
could agree (corrected for chance agreement): 

_ P(A)-P(E) 

1 — P(E) 

P(A) is the proportion of times that the hypothesis agrees with the 
standard; i.e., percent correct. P(E) is the proportion of times that the 
hypothesis and the standard would be expected to agree by chance. 
P(E) can be computed from some other knowledge, or it can be com¬ 
puted from the actual confusion matrix for the labels being com¬ 
pared. The bounds for k are just like those for percent correct; when 
there is no agreement (other than what would be expected by chance) 
k = 0. When there is complete agreement, K = 1. 

The k statistic is most often used when there is no ‘Gold Stan¬ 
dard' at all. This occurs, for example, when comparing human label¬ 
ers to each other on a difficult subjective task. In this case, k is a very 
useful evaluation metric, the ‘average pairwise agreement corrected 
for chance agreement’. Krippemiorf (1980) suggests that a value of 
k > .8 can be considered good reliability. 
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can be computed as follows: 


P ( w \ c ) = 


C ( w ) 

C ( c ) 


P(Ci\Ci-l ) 


C{cj-\Cj) 

E c C(c/_ic) 


A class-based A-gram can rely on standard tagsets like the Penn tagset 
to define the classes, or on application-specific sets (for example using tags 
like CITY and AIRLINE for an airline information system). The classes 
can also be automatically induced by clustering words in a corpus (Brown 
et al ., 1992). A number of researchers have shown that class-based A-grams 
can be useful in decreasing the perplexity and word-error rate of language 
models, especially if they are mixed in some way with regular word-based 
A-grams (Jelinek, 1990; Kneser and Ney, 1993; Heeman, 1999; Samuelsson 
andReichl, 1999). 


8.8 Summary 

This chapter introduced the idea of parts-of-speech and part-of-speech tag¬ 
ging. The main ideas: 

• Languages generally have a relatively small set of closed class words, 
which are often highly frequent, generally act as function words, and 
can be very ambiguous in their part-of-speech tags. Open class words 
generally include various kinds of nouns, verbs, adjectives. There 
are a number of part-of-speech coding schemes, based on tagsets of 
between 40 and 200 tags. 

• Part-of-speech tagging is the process of assigning a part-of-speech 
label to each of a sequence of words. Taggers can be characterized as 
rule-based or stochastic. Rule-based taggers use hand-written rules to 
distinguish tag ambiguity. Stochastic taggers are either HMM-based, 
choosing the tag sequence which maximizes the product of word like¬ 
lihood and tag sequence probability, or cue-based, using decision trees 
or maximum entropy models to combine probabilistic features. 

• Taggers are often evaluated by comparing their output from a test-set 
to human labels for that test set. Error analysis can help pinpoint areas 
where a tagger doesn’t perform well. 
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Bibliographical and Historical Notes 

The earliest implemented part-of-speech assignment algorithm may have 
been part of the parser in Zellig Harris’s Transformations and Discourse 
Analysis Project (TDAP), which was implemented between June 1958 and 
July 1959 at the University of Pennsylvania (Harris, 1962). Previous nat¬ 
ural language processing systems had used dictionaries with part-of-speech 
information for words, but have not been described as performing part-of- 
speech disambiguation. As part of its parsing, TDAP did part of speech 
disambiguation via 14 hand-written rules, whose use of part-of-speech tag 
sequences prefigures all the modem algorithms, and which were run in an 
order based on the relative frequency of tags for a word. The parser/tagger 
was reimplemented recently and is described by Joshi and Hopely (1999) 
and Karttunen (1999), who note that the parser was essentially implemented 
(ironically in a very modem way) as a cascade of finite-state transducers. 

Soon after the TDAP parser was the Computational Grammar Coder 
(CGC) of Klein and Simmons (1963). The CGC had three components: a 
lexicon, a morphological analyzer, and a context disambiguator. The small 
1500-word lexicon included exceptional words that could not be accounted 
for in the simple morphological analyzer, including function words as well as 
irregular nouns, verbs, and adjectives. The morphological analyzer used in¬ 
flectional and derivational suffixes to assign part-of-speech classes. A word 
was run through the lexicon and morphological analyzer to produce a can¬ 
didate set of parts-of-speech. A set of 500 context rules were then used to 
disambiguate this candidate set, by relying on surrounding islands of unam¬ 
biguous words. For example, one rule said that between an ARTICLE and a 
VERB, the only allowable sequences were ADJ-NOUN, NOUN-ADVERB, 
or NOUN-NOUN. The CGC algorithm reported 90% accuracy on applying 
a 30-tag tagset to articles from the Scientific American and a children’s en¬ 
cyclopedia. 

The taggit tagger (Greene and Rubin, 1971) was based on the Klein 
and Simmons (1963) system, using the same architecture but increasing the 
size of the dictionary and the size of the tagset (to 87 tags). For example the 
following sample rule, which states that a word x is unlikely to be a plural 
noun (NNS) before a third person singular verb (VBZ): 

x VBZ not NNS 

Taggit was applied to the Brown Corpus and, according to Francis 
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and Kucera (1982, p. 9), “resulted in the accurate tagging of 77% of the 
corpus” (the remainder of the Brown Corpus was tagged by hand). 

In the 1970’s the Lancaster-Oslo/Bergen (LOB) Corpus was compiled 
as a British English equivalent of the Brown Corpus. It was tagged with 
the CLAWS tagger (Marshall, 1983, 1987; Garside, 1987), a probabilis¬ 
tic algorithm which can be viewed as an approximation to the HMM tag¬ 
ging approach. The algorithm used tag bigram probabilities, but instead 
of storing the word-likelihood of each tag, tags were marked either as rare 
(P(tag|word) < .01) infrequent (P(tag|word) < .10), or normally frequent 
(P(tagjword) > .10), 

The probabilistic parts tagger of Church (1988) was very close to a 
full HMM tagger. It extended the CLAWS idea to assign full lexical prob¬ 
abilities to each word/tag combination, and used Yiterbi decoding to find a 
tag sequence. Like the CLAWS tagger, however, it stored the probability of 
the tag given the word: 

P(tag|word) * P(tag| previous n tags) (8.11) 

rather than using the probability of the word given the tag, as an HMM tagger 
does: 


P(word|tag) * P(tag| previous n tags) (8.12) 

Later taggers explicitly introduced the use of the Hidden Markov Model, 
often with the EM training algorithm (Kupiec, 1992; Merialdo, 1994; Wei- 
schedel et al , 1993), including the use of variable length Markov models 
(Schiitze and Singer, 1994). 

A number of recent stochastic algorithms use various statistical and 
machine-learning tools to estimate the probability of a tag or tag-sequence 
given a large number of relevant features such as the neighboring words and 
neighboring parts of speech, as well as assorted orthographic and morpho¬ 
logical features. These features are then combined to estimate the probability 
of tag either via a decision tree (Jelinek et al , 1994; Magerman, 1995), the 
Maximum Entropy algorithm (Ratnaparkhi, 1996), log-linear models (Franz, 
1996), or networks of linear separators (SNOW) (Roth and Zelenko, 1998). 
Brill (1997) presents a unsupervised version of the TBL algorithm. 
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Exercises 

8.1 Find one tagging error in each of the following sentences that are 
tagged with the Penn Treebank tagset: 

a. I/PRP need/VBP a/DT flight/NN from/EN Atlanta/NN 

b. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 

c. I/PRP have/VB a/DT friend/NN living/VBG in/LN Denver/NNP 

d. What/WDT flights/NNS do/VBP you/PRP have/VB from/IN Milwau- 
kee/NNP to/IN Tampa/NNP 

e. Can/VBP you/PRP list/VB the/DT nonstop/JJ aftemoon/NN flights/NNS 

8.2 Use the Penn Treebank tagset to tag each word in the following sen¬ 
tences from Damon Runyon’s short stories. You may ignore punctuation. 
Some of these are quite difficult; do your best. 

a. It is a nice night. 

b. This crap game is over a garage in Fifty-second Street... 

c. ... Nobody ever takes the newspapers she sells ... 

d. He is a tall, skinny guy with a long, sad, mean-looking kisser, and a 
mournful voice. 

e. ... I am sitting in Mindy’s restaurant putting on the gefillte fish, which 
is a dish I am very fond of, ... 

f. When a guy and a doll get to taking peeks back and forth at each other, 
why there you are indeed. 

8.3 Now compare your tags from Exercise 1 with one or two friend’s an¬ 
swers. On which words did you disagree the most? Why? 

8.4 Implement the Kappa algorithm of page 313, and compute the agree¬ 
ment between you and your friends. To compute P(E) and P(E), you may 
used the following equations modified from Walker et al (1997). These as¬ 
sume that you have the confusion matrix M, where the correct answers label 
the rows and the hypotheses label the columns (as seen in the Methodology 
Box on page 311): 
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where ti is the sum of the counts in row i of M, and T is the sum of the all 
the counts in M, 

8.5 Now tag the sentences in Exercise 8.2 using the more detailed Cl tagset 
in Appendix C. 

8.6 Implement the TBL algorithm in Figure 8.10. Create a small number 
of templates and train the tagger on any POS-tagged training set you can 
find. 


8.7 Recall that the Church (1988) tagger is not an HMM tagger since it 
incorporates the probability of the tag given the word: 

P(tag|word) * P(tag| previous n tags) (8.13) 

rather than using the likelihood of the word given the tag, as an HMM 
tagger does: 

P(word|tag) * P(tag|previous n tags) (8.14) 

As a gedanken-experiment, construct a sentence, a set of tag transi¬ 
tion probabilities, and a set of lexical tag probabilities that demonstrate a 
way in which the HMM tagger can produce a better answer than the Church 
tagger. 

8.8 Build an HMM tagger. This requires (1) that you have implemented the 
Yiterbi algorithm from Chapter 5 or Chapter 7, (2) that you have a dictionary 
with part-of-speech information and (3) that you have either (a) a part-of- 
speech-tagged corpus or (b) an implementation of the Forward Backward 
algorithm. If you have a labeled corpus, train the transition and observation 
probabilities of an HMM tagger directly on the hand-tagged data. If you 
have an unlabeled corpus, train using Forward Backward. 

8.9 Now run your algorithm on a small test set that you have hand-labeled. 
Find five errors and analyze them. 




CONTEXT-FREE 
GRAMMARS FOR 
ENGLISH 


^Sentence^ 

NP VP 

/\ 

the man Verb NP 


took the book 

The first context-free grammar parse tree (Chomsky, 1956) 


If on a winter's night a traveler by Italo Calvino 
Nuclear and Radiochemistry by Gerhart Friedlander et al. 

The Fire Next Time by James Baldwin 

A Tad Overweight but Violet Eyes to Die For by G. B. Trudeau 
Sometimes a Great Notion by Ken Kesey 
Dancer from the Dance by Andrew Holleran 

6 books in English whose titles are not constituents, 
from Pullum (1991, p. 195) 


In her essay The Anatomy of a Recipe , M. F. K. Fisher (1968) wryly com¬ 
ments that it is “modish” to refer to the anatomy of a thing or problem. The 
similar use of grammar to describe the structures of an area of knowledge 
had a vogue in the 19th century (e.g. Busby’s (1818) A Grammar of Mu¬ 
sic and Field’s (1888) A Grammar of Colouring). In recent years the word 
grammar has made a reappearance, although usually now it is the grammar 
rather than a grammar that is being described (e.g. The Grammar of Graph¬ 
ics, The Grammar of Conducting). Perhaps scholars are simply less modest 
than they used to be? Or perhaps the word grammar itself has changed a 
bit, from ‘a listing of principles or structures’, to ‘those principles or struc- 




320 


Chapter 9. Context-Free Grammars for English 


SYNTAX 


CON¬ 

STITUENT 


tures as an field of inquiry’. Following this second reading, in this chapter 
we turn to what might be called The Grammar of Grammar, or perhaps The 
Grammar of Syntax. 

The word syntax comes from the Greek sfntaxis, meaning ‘setting 
out together or arrangement’, and refers to the way words are arranged to¬ 
gether. We have seen various syntactic notions in previous chapters. Chap¬ 
ter 8 talked about part-of-speech categories as a kind of equivalence class for 
words. Chapter 6 talked about the importance of modeling word order. This 
chapter and the following ones introduce a number of more complex no¬ 
tions of syntax and grammar. There are three main new ideas: constituency, 
grammatical relations, and subcategorization and dependencies. 

The fundamental idea of constituency is that groups of words may be¬ 
have as a single unit or phrase, called a constituent. For example we will 
see that a group of words called a noun phrase often acts as a unit; noun 
phrases include single words like she or Michael and phrases like the house , 
Russian Hill , and a well-weathered three-story structure. This chapter will 
introduce the use of context-free grammars, a formalism that will allow us 
to model these constituency facts. 

Grammatical relations are a formalization of ideas from traditional 
grammar about subjects and objects. In the sentence: 

(9.1) She ate a mammoth breakfast. 

the noun phrase She is the subject and a mammoth breakfast is the object. 
Grammatical relations will be introduced in this chapter when we talk about 
syntactic agreement, and will be expanded upon in Chapter 11. 

Subcategorization and dependency relations refer to certain kinds 
of relations between words and phrases. For example the verb want can be 
followed by an infinitive, as in I want to fly to Detroit , or a noun phrase, as in 
I want a flight to Detroit. But the verb find cannot be followed by an infinitive 
found to fly to Dallas). These are called facts about the subcategory of the 
verb, which will be discussed starting on page 337, and again in Chapter 11. 

All of these kinds of syntactic knowledge can be modeled by various 
kinds of grammars that are based on context-free grammars. Context-free 
grammars are thus the backbone of many models of the syntax of natu¬ 
ral language (and, for that matter, of computer languages). As such they 
are integral to most models of natural language understanding, of grammar 
checking, and more recently of speech understanding. They are powerful 
enough to express sophisticated relations among the words in a sentence, yet 
computationally tractable enough that efficient algorithms exist for parsing 
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sentences with them (as we will see in Chapter 10). Later in Chapter 12 we 
will introduce probabilistic versions of context-free grammars, which model 
many aspects of human sentence processing and which provide sophisticated 
language models for speech recognition. 

In addition to an introduction to the grammar formalism, this chapter 
also provides an overview of the grammar of English. We will be modeling 
example sentences from the Air Traffic Information System (ATIS) domain 
(Hemphill et a/., 1990). ATIS systems are spoken language systems that 
can help book airline reservations. Users try to book flights by conversing 
with the system, specifying constraints like Vd like to fly from Atlanta to 
Denver. The government funded a number of different research sites across 
the country to build ATIS systems in the early 90’s, and so a lot of data was 
collected and a significant amount of research has been done on the resulting 
data. The sentences we will be modeling in this chapter are the user queries 
to the system. 

9.1 Constituency 

How do words group together in English? How do we know they are re¬ 
ally grouping together? Let’s consider the standard grouping that is usually 
called the noun phrase or sometimes the noun group. This is a sequence phrase 
of words surrounding at least one noun. Here are some examples of noun noun group 
phrases (thanks to Damon Runyon): 

three parties from Brooklyn 
a high-class spot such as Mindy’s 
the Broadway coppers 
they 

Harry the Horse 

the reason he comes into the Hot Box 

How do we know that these words group together (or ‘form a con¬ 
stituent’)? One piece of evidence is that they can all appear in similar syn¬ 
tactic environments, for example before a verb. 

three parties from Brooklyn arrive... 
a high-class spot such as Mindy’s attracts... 
the Broadway coppers love... 
they sit 
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But while the whole noun phrase can occur before a verb, this is not 
true of each of the individual words that make up a noun phrase. The follow¬ 
ing are not grammatical sentences of English (recall that we use an asterisk 
(*) to mark fragments that are not grammatical English sentences): 

*from arrive... 

*as attracts... 

*the is... 

*spot is... 

Thus in order to correctly describe facts about the ordering of these 
words in English, we must be able to say things like “Noun Phrases can 
occur before verbs”. 

Other kinds of evidence for constituency come from what are called 
preposed preposed or postposed constructions. For example, the prepositional phrase 
postposed on September seventeenth can be placed in a number of different locations in 
the following examples, including preposed at the beginning, and postposed 
at the end: 

On September seventeenth, I’d like to fly from Atlanta to Denver 

I’d like to fly on September seventeenth from Atlanta to Denver 

I’d like to fly from Atlanta to Denver on September seventeenth 


But again, while the entire phrase can be placed differently, the indi¬ 
vidual words making up the phrase cannot be: 

*On September, I’d like to fly seventeenth from Atlanta to Denver 

* On I’d like to fly September seventeenth from Atlanta to Denver 

*Fd like to fly on September from Atlanta to Denver seventeenth 

Section 9.11 will give other motivations for context-free grammars 
based on their ability to model recursive structures. 

There are many other kinds of evidence that groups of words often 
behave as a single constituent (see Radford (1988) for a good survey). 

9.2 Context-Free Rules and Trees 


CFG 


The most commonly used mathematical system for modeling constituent 
structure in English and other natural languages is the Context-Free Gram¬ 
mar, or CFG. Context-free grammars are also called Phrase-Structure 
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Grammars, and the formalism is equivalent to what is also called Backus- 
Naur Form or BNF. The idea of basing a grammar on constituent structure 
dates back to the psychologist Wilhelm Wundt (1900), but was not formal¬ 
ized until Chomsky (1956), and, independently, Backus (1959). 

A context-free grammar consists of a set of rules or productions, each 
of which expresses the ways that symbols of the language can be grouped 
and ordered together, and a lexicon of words and symbols. For example, 
the following productions expresses that a NP (or noun phrase), can be 
composed of either a ProperNoun or of a determiner (Det) followed by a 
Nominal ; a Nominal can be one or more Nouns. 


NP -»• 

Det Nominal 

(9.2) 

NP -»• 

ProperNoun 

(9.3) 

Nominal —> 

Noun Noun Nominal 

(9.4) 


Context free rules can be hierarchically embedded, so we could com¬ 
bine the previous rule with others like these which express facts about the 
lexicon: 


Dot —^ a (9.5) 

Det the (9.6) 

Noun flight (9.7) 

The symbols that are used in a CFG are divided into two classes. The 
symbols that correspond to words in the language (‘the’, ‘nightclub’) are 
called terminal symbols; the lexicon is the set of rules that introduce these 
terminal symbols. The symbols that express clusters or generalizations of 
these are called nonterminals. In each context-free rule, the item to the right 
of the arrow (—>) is an ordered list of one or more terminals and nontermi¬ 
nals, while to the left of the arrow is a single nonterminal symbol expressing 
some cluster or generalization. Notice that in the lexicon, the nonterminal 
associated with each word is its lexical category, or part-of-speech, which 
we defined in Chapter 8. 

A CFG is usually thought of in two ways: as a device for generating 
sentences, or as a device for assigning a structure to a given sentence. As a 
generator, we could read the —> arrow as ‘rewrite the symbol on the left with 
the string of symbols on the right’. So starting from the symbol 


RULES 

LEXICON 

NP 


TERMINAL 

NONTERMI¬ 

NAL 


NP, 
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we can use rule 9.2 to rewrite NP as 
Det Nominal, 
and then rule 9.4: 

DetNoun, 

and finally via rules 9.5 and 9.7 as 
a flight, 

derived We say the string a flight can be derived from the nonterminal NP. 

Thus a CFG can be used to randomly generate a series of strings. This 
derivation sequence of rule expansions is called a derivation of the string of words. 
parse tree It is common to represent a derivation by a parse tree (commonly shown 

inverted with the root at the top). Here is the tree representation of this 
derivation: 



START 

SYMBOL 


VERB 

PHRASE 


The formal language defined by a CFG is the set of strings that are 
derivable from the designated start symbol. Each grammar must have one 
designated start symbol, which is often called S. Since context-free gram¬ 
mars are often used to define sentences, S is usually interpreted as the ‘sen¬ 
tence’ node, and the set of strings that are derivable from S is the set of 
sentences in some simplified version of English. 

Let’s add to our sample grammar a couple of higher-level rules that 
expand S, and a couple others. One will express the fact that a sentence can 
consist of a noun phrase and a verb phrase: 


S —> NP VP I prefer a morning flight 

A verb phrase in English consists of a verb followed by assorted other 
things; for example, one kind of verb phrase consists of a verb followed by 
a noun phrase: 
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VP Verb NP prefer a morning flight 
Or the verb phrase may have a noun phrase and a prepositional phrase: 

VP —> Verb NP PP leave Boston in the morning 
Or the verb may be followed just by a preposition-phrase: 

VP Verb PP leaving on Thursday 

A prepositional phrase generally has a preposition followed by a noun 
phrase. For example, a very common type of prepositional phrase in the 
ATIS corpus is used to indicate location or direction: 

PP Preposition NP from Los Angeles 

The NP inside a PP need not be a location; PPs are often used with 
times and dates, and with other nouns as well; they can be arbitrarily com¬ 
plex. Here are ten examples from the ATIS corpus: 

to Seattle on these flights 

in Minneapolis about the ground transportation in Chicago 

on Wednesday of the round trip flight on United Airlines 

in the evening of the AP fifty seven flight 

on the ninth of July with a stopover in Nashville 

Figure 9.2 gives a sample lexicon and Figure 9.3 summarizes the gram¬ 
mar rules we’ve seen so far, which we’ll call £q. Note that we can use the 
or-symbol | to indicate that a non-terminal has alternate possible expansions. 

We can use this grammar to generate sentences of this ATIS-language’. 

We start with S, expand it to NP VP , then choose a random expansion of NP 
(let’s say to 7), and a random expansion of VP (let’s say to Verb NP ), and so 
on until we generate the string I prefer a morning flight. Figure 9.4 shows a 
parse tree that represents a complete derivation of I prefer a morning flight. 

It is sometimes convenient to represent a parse tree in a more compact 
format called bracketed notation, essentially the same as LISP tree repre- notion 
sentation; here is the bracketed representation of the parse tree of Figure 9.4: 

L s \np \pro Cl I vp tv prefer] [np I o e t a] I Nom In morning] [ N flight]]]]] 

A CFG like that of Lq defines a formal language. We saw in Chapter 2 
that a formal language is a set of strings. Sentences (strings of words) that 
can be derived by a grammar are in the formal language defined by that 
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Noun —> flights \ breeze \ trip \ morning \ ... 
Verb —> is \ prefer \ like \ need \ want \ fly 
Adjective —> cheapest \ non —stop \ first \ latest 
| other | direct \ ... 

Pronoun —> me \ I \ you \ it \ ... 

Proper-Noun —► Alaska \ Baltimore \ Los Angeles 

| Chicago | United \ American \ ... 
Determiner —> the \ a \ an \ this \ these \ that \ ... 
Preposition -¥ from \ to \ on \ near \ ... 
Conjunction —> and \ or \ but \ ... 


Figure 9.2 The lexicon for Lq. 


s 


NPVP 

I + want a morning flight 

NP 


Pronoun 

I 


| 

Proper-Noun 

Los Angeles 



Det Nominal 

a + flight 

Nominal 


Noun Nominal 

morning + flight 


1 

Noun 

flights 

VP 


Verb 

do 


| 

Verb NP 

want + a flight 



Verb NP PP 

leave + Boston + in the morning 


1 

VerbPP 

leaving + on Thursday 

PP 


Preposition NP 

from + Los Angeles 


Figure 9.3 The grammar for To, with example phrases for each rule. 


grammar, and are called grammatical sentences. Sentences that cannot be 
derived by a given formal grammar are not in the language defined by that 
grammar, and are referred to as ungrammatical. This hard line between 
‘in’ and ‘out’ characterizes all formal languages but is only a very simplified 
model of how natural languages really work. This is because determining 
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whether a given sentence is part of a given natural language (say English) 
often depends on the context. In linguistics, the use of formal languages to 
model natural languages is called generative grammar, since the language grammar^ 
is defined by the set of possible sentences ‘generated’ by the grammar. 

We conclude this section by way of summary with a quick formal de¬ 
scription of a context free grammar and the language it generates. A context- 
free grammar has four parameters (technically ‘is a 4-tuple’): 

1. a set of non-terminal symbols (or ‘variables’) N 

2. a set of terminal symbols £ (disjoint from N ) 

3. a set of productions P, each of the form A —> a, where A is a non¬ 
terminal and a is a string of symbols from the infinite set of strings 
(LUA)*. 

4. a designated start symbol S 

A language is defined via the concept of derivation. One string de¬ 
rives another one if it can be rewritten as the second one via some series of 
rule applications. More formally, following Hopcroft and Ullman (1979), if 
A —y (3 is a production of P and a and y are any strings in the set (LU A)*, 
then we say that aAy directly derives a(3y, or aAy =^> a(3y. Derivation is derves y 
then a generalization of direct derivation. Let GCi, 0C2, ..., a m be strings in 
(£U A)*,m > 1, such that 

cci => a 2 ,a 2 => a 3 ,...,a m _i => a m (9.8) 

We say that ai derives ot^, or cti => a m . 


DERIVES 
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We can then formally define the language L G generated by a grammar 
G as the set of strings composed of terminal symbols which can be derived 
from the designed start symbol S. 

Lg = W\w is in E* and S w (9.9) 

The problem of mapping from a string of words to its parse tree is 
parsing called parsing; we will define algorithms for parsing in Chapter 10 and in 
Chapter 12. 

9.3 Sentence-Level Constructions 

The remainder of this chapter will introduce a few of the more complex as¬ 
pects of the phrase structure of English; for consistency we will continue 
to focus on sentences from the ATIS domain. Because of space limitations, 
our discussion will necessarily be limited to highlights. Readers are strongly 
advised to consult Quirk et al (1985a), which is by far the best current ref¬ 
erence grammar of English. 

In the small grammar £o, we only gave a single sentence-level con¬ 
struction for declarative sentences like I prefer a morning flight. There are 
a great number of possible overall sentence structures, but 4 are particularly 
common and important: declarative structure, imperative structure, yes-no- 
question structure, and wh-question structure, 
declarative Sentences with declarative structure have a subject noun phrase fol¬ 

lowed by a verb phrase, like T prefer a morning flight’. Sentences with this 
structure have a great number of different uses that we will follow up on in 
Chapter 19. Here are a number of examples from the ATIS domain: 

The flight should be eleven a.m tomorrow 

I need a flight to Seattle leaving from Baltimore making a stop in Min¬ 
neapolis 

The return flight should leave at around seven p.m 
I would like to find out the flight number for the United flight that ar¬ 
rives in San Jose around ten p.m 
I’d like to fly the coach discount class 
I want a flight from Ontario to Chicago 
I plan to leave on July first around six thirty in the evening 


IMPERATIVE 


Sentences with imperative structure often begin with a verb phrase, 
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and have no subject. They are called imperative because they are almost 
always used for commands and suggestions; in the ATIS domain they are 
commands to the system. 

Show the lowest fare 

Show me the cheapest fare that has lunch 

Give me Sunday’s flights arriving in Las Vegas from Memphis and 
New York City 

List all flights between five and seven p.m 
List all flights from Burbank to Denver 

Show me all flights that depart before ten a.m and have first class fares 

Show me all the flights leaving Baltimore 

Show me flights arriving within thirty minutes of each other 

Please list the flights from Charlotte to Long Beach arriving after lunch 

time 

Show me the last flight to leave 

To model this kind of sentence structure, we can add another rule for the 
expansion of S : 

S —> VP Show the lowest fare 

Sentences with yes-no-question structure are often (though not al¬ 
ways) used to ask questions (hence the name), and begin with a auxiliary 
verb, followed by a subject NP, followed by a VP. Here are some exam¬ 
ples (note that the third example is not really a question but a command or 
suggestion; Chapter 19 will discuss the pragmatic uses of these question 
forms): 

Do any of these flights have stops? 

Does American’s flight eighteen twenty five serve dinner? 

Can you give me the same information for United? 

Here’s the rule: 

S -► Aux NP VP 

The most complex of the sentence-level structures we will examine 
are the various wh- structures. These are so named because one of their 
constituents is a wh- phrase, i.e. one that includes a wh- word (who, where, 
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WH-NON- 

SUBJECT- 

QUESTION 


9.4 


HEAD 


what, which, how, why). These may be broadly grouped into two classes of 
sentence-level structures. The wh-subject-question structure is identical to 
the declarative structure, except that the first noun phrase contains some wh- 
word. 


What airlines fly from Burbank to Denver? 

Which flights depart Burbank after noon and arrive in Denver by six 
p.m? 

Which flights serve breakfast? 

Which of these flights have the longest layover in Nashville? 


Here is a rule. Exercise 9.10 discusses rules for the constituents that make 
up the Wh-NP. 

S Wh-NP VP 

In the wh-non-subject-question structure, the wh-phrase is not the 
subject of the sentence, and so the sentence includes another subject. In 
these types of sentences the auxiliary appears before the subject NP , just as 
in the yes-no-question structures. Here is an example: 

What flights do you have from Burbank to Tacoma Washington? 


Here is a sample rule: 

S Wh-NP Awe NP VP 

There are other sentence-level structures we won’t try to model here, 
like fronting, in which a phrase is placed at the beginning of the sentence for 
various discourse purposes (for example often involving topicalization and 
focus): 

On Tuesday, I’d like to fly from Detroit to Saint Petersburg 


The Noun Phrase 


We can view the noun phrase as revolving around a head, the central noun 
in the noun phrase. The syntax of English allows for both prenominal (pre¬ 
head) modifiers and post-nominal (post-head) modifiers. 
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Before the Head Noun 

We have already discussed some of the parts of the noun phrase; the deter¬ 
miner, and the use of the Nominal constituent for representing double noun 
phrases. We have seen that noun phrases can begin with a determiner, as 
follows: 

a stop 
the flights 
that fare 
this flight 
those flights 
any flights 
some flights 

There are certain circumstances under which determiners are optional 
in English. For example, determiners may be omitted if the noun they modify 
is plural: 

Show mo flights from San Francisco to Denver on weekdays 

As we saw in Chapter 8, mass nouns don’t require determination. Re¬ 
call that mass nouns often (not always) involve something that is treated like 
a substance (including e.g. water and snow), don’t take the indefinite article 
‘ a\ and don’t tend to pluralize. Many abstract nouns are mass nouns (music, 
homework). Mass nouns in the ATIS domain include breakfast, lunch, and 
dinner. 

Does this flight serve dinner? 

Exercise 9.4 asks the reader to represent this fact in the CFG formalism. 

Word classes that appear in the NP before the determiner are called 
predeterminers. Many of these have to do with number or amount; a com- m [ in e e c r e s er 
mon predeterminer is all : 

all the flights 
all flights 

A number of different kinds of word classes can appear in the NP be¬ 
tween the determiner and the head noun (the ‘post-determiners’). These 
include cardinal numbers, ordinal numbers, and quantifiers. Examples numbers 
of cardinal numbers: numbers 


QUANTIFIERS 



332 


Chapter 9. Context-Free Grammars for English 


two friends 
one stop 

Ordinal numbers include first, second, third, etc, but also words like 
next, last, past, other, and another: 

the first one 

the next day 

the second leg 

the last flight 

the other American flight 

any other fares 

Some quantifiers (many, (a) few, several) occur only with plural count nouns: 
many fares 

The quantifiers much and a little occur only with noncount nouns. 

Adjectives occur after quantifiers but before nouns. 

a first-class fare 
a nonstop flight 
the longest layover 
the earliest lunch flight 

phrase ive Adjectives can also be grouped into a phrase called an adjective phrase 

ap or AP. APs can have an adverb before the adjective (see Chapter 8 for defi¬ 

nitions of adjectives and adverbs): 

the least expensive fare 

We can combine all the options for prenominal modifiers with one rule as 
follows: 

NP —> (Det) (Card) (Ord) (Quant) (AP) Nominal (9.10) 

This simplified noun phrase rule has a flatter structure and hence is 
simpler than most modem theories of grammar. We present this simplified 
rule because there is no universally agreed-upon internal constituency for the 
noun phrase. 

Note the use of parentheses () to mark optional constituents. A mle 
with one set of parentheses is really a shorthand for two rules, one with the 
parentheses, one without. 
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After the Noun 

A head noun can be followed by postmodifiers. Three kinds of nominal 
postmodifiers are very common in English: 

prepositional phrases all flights from Cleveland 
non-finite clauses any flights arriving after eleven a.m. 
relative clauses a flight that serves breakfast 

Prepositional phrase postmodifiers are particularly common in the ATIS 
corpus, since they are used to mark the origin and destination of flights. Here 
are some examples, with brackets inserted to show the boundaries of each 
PP; note that more than one PP can be strung together: 

any stopovers [for Delta seven fifty one] 
all flights [from Cleveland] [to Newark] 
arrival [in San Jose] [before seven p.m] 
a reservation [on flight six oh six] [from Tampa[ [to Montreal] 

Here’s a new NP rule to account for one to three PP postmodifiers: 

Nominal —> Nominal PP (PP) (PP) 

The three most common kinds of non-finite postmodifiers are the gerun¬ 
dive (- ing ), - ed , and infinitive forms. 

Gerundive postmodifiers are so-called because they consist of a verb 
phrase that begins with the gerundive (-ing) form of the verb. In the follow¬ 
ing examples, the verb phrases happen to all have only prepositional phrases 
after the verb, but in general this verb phrase can have anything in it (any¬ 
thing, that is, which is semantically and syntactically compatible with the 
gerund verb). 

any of those (leaving on Thursday) 

any flights (arriving after eleven a.m) 

flights (arriving within thirty minutes of each other) 

We can define the NP as follows, making use of a new nonterminal GerundVP : 


NON-FINITE 

GERUNDIVE 


Nominal —> Nominal GerundVP 
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We can make rules for GerundVP constituents by duplicating all of our YP 
productions, substituting GerundV for V. 

GerundVP GerundV NP 
| GerundV PP 
| GerundV 
| GerundV NPPP 

GerundV can then be defined as: 

GerundV being | prefering | arriving \ leaving | ... 


The phrases in italics below are examples of the two other common kinds of 
non-finite clauses, infinitives and -ed forms: 

the last flight to arrive in Boston 

I need to have dinner served 

Which is the aircraft used by this fiighfl 

A postnominal relative clause (more correctly a restrictive relative 
clause), is a clause that often begins with a relative pronoun ( that and who 
are the most common). The relative pronoun functions as the subject of the 
embedded verb in the following examples: 

a flight that serves breakfast 
flights that leave in the morning 

the United flight that arrives in San Jose around ten p.m. 
the one that leaves at ten thirty five 

We might add rules like the following to deal with these: 

Nominal —> Nominal RelClause (9.11) 

RelClause (who \ that) VP (9.12) 

(9.13) 

The relative pronoun may also function as the object of the embedded 
verb, as in the following example; we leave as an exercise for the reader 
writing grammar rules for more complex relative clauses of this kind. 
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the earliest American Airlines flight that I can get 

Various postnominal modifiers can be combined, as the following examples 
show: 


a flight (from Phoenix to Detroit) (leaving Monday evening) 

I need a flight (to Seattle) (leaving from Baltimore) (making a 
stop in Minneapolis) 

evening flights (from Nashville to Houston) (that serve dinner) 
a friend (living in Denver) (that would like to visit me here in 
Washington DC) 


9.5 Coordination 

Noun phrases and other units can be conjoined with conjunctions like and , 
or , and but. For example a coordinate noun phrase can consist of two other 
noun phrases separated by a conjunction (we used brackets to mark the con¬ 
stituents): 

Please repeat [yy p [np the flights] and [yy p the costs]] 

I need to know [yyp [np the aircraft] and [np flight number]] 

I would like to fly from Denver stopping in [ NP [ NP Pittsburgh] 
and [np Atlanta]] 

Here’s a new rule for this: 

NP NP and NP (9.14) 

In addition to NPs, most other kinds of phrases can be conjoined (for 
example including sentences, VPs, and PPs): 

What flights do you have [y P [y P leaving Denver] and [y P arriv¬ 
ing in San Francisco]] 

[s Lv I’m interested in a flight from Dallas to Washington] and N 
I’m also interested in going to Baltimore]] 

Similar conjunction rules can be built for VP and S conjunction: 


CONJUNC¬ 

TIONS 

COORDINATE 


VP VPandVP 
S SandS 


(9.15) 

(9.16) 
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9.6 Agreement 

In Chapter 3 we discussed English inflectional morphology. Recall that most 
verbs in English can appear in two forms in the present tense: the form used 
for third-person, singular subjects (the flight does\ and the form used for all 
other kinds of subjects (all the flights do , I do). The third-person-singular 
(3sg form usually has a final -v where the non-3sg form does not. Here are 
some examples, again using the verb do , with various subjects: 

You [vp I v 7 said \s there were two flights that were the cheapest ]]] 

Do [np any flights] stop in Chicago? 

Do byp all of these flights] offer first class service? 

Do Ovp I] get dinner on this flight? 

Do bvp you] have a flight from Boston to Forth Worth? 

Does |jvp this flight] stop in Dallas? 

Does | y\/p that flight] serve dinner? 

Does [a rp Delta] fly from Atlanta to San Francisco? 

Here are more examples with the verb leave : 

What flights leave in the morning? 

What flight leaves from Pittsburgh? 

This agreement phenomenon occurs whenever there is a verb that has 
some noun acting as its subject. Note that sentences in which the subject 
does not agree with the verb are ungrammatical: 

* | What flight] leave in the morning? 

*Does bvp you] have a flight from Boston to Forth Worth? 

*Do [a ip this flight] stop in Dallas? 

How can we modify our grammar to handle these agreement phenom¬ 
ena? One way is to expand our grammar with multiple sets of rules, one rule 
set for 3sg subjects, and one for non-3sg subjects. For example, the rule that 
handled these yes-no-questions used to look like this: 

S Aux NP VP 

We could replace this with two rules of the following form: 

S 3sgAux 3sgNP VP 

S Non3sgAux Non3sgNP VP 
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We could then add rules for the lexicon like these: 

3sgAux —> does \ has \ can \ ... 

Non3sgAux —> do \ have \ can \ ... 

But we would also need to add rules for 3sgNP and Non3sgNP , again 
by making two copies of each rule for NP. While pronouns can be first, 
second, or third person, full lexical noun phrases can only be third person, 
so for them we just need to distinguish between singular and plural: 

3SgNP (Det) (Card) (Ord) (Quant) (AP) SgNominal 

Non3SgNP —> (Det) (Card) (Ord) (Quant) (AP) PINominal 

SgNominal SgNoun \ SgNoun SgNoun 

PINominal —> PINoun \ SgNoun PINoun 

SgNoun —> flight \ fare \ dollar \ reservation \ ... 

PINoun —> flights \ fares \ dollars \ reservations \ ... 

Dealing with the first and second person pronouns is left as an exercise for 
the reader. 

A problem with this method of dealing with number agreement is that 
it doubles the size of the grammar. Every rule that refers to a noun or a verb 
needs to have a ‘singular’ version and a ‘plural’ version. This rule prolif¬ 
eration will also have to happen for the noun’s case; for example English 
pronouns have nominative (I, she, he, they) and accusative (me, her, him, 
them) versions. We will need new versions of every NP and N rule for each 
of these. 

A more significant problem occurs in languages like German or French, 
which not only have noun-verb agreement like English, but also have gender 
agreement; the gender of a noun must agree with the gender of its modify¬ 
ing adjective and determiner. This adds another multiplier to the rule sets of 
the language. 

Chapter 11 will introduce a way to deal with these agreement problems 
without exploding the size of the grammar, by effectively parameterizing 
each nonterminal of the grammar with feature structures. 

9.7 The Verb Phrase and Subcategorization 

The verb phrase consists of the verb and a number of other constituents. In 
the simple rules we have built so far, these other constituents include NP' s 


CASE 

NOMINATIVE 

ACCUSATIVE 


GENDER 

AGREEMENT 



338 


Chapter 9. Context-Free Grammars for English 


and PP’ s and combinations of the two: 

VP Verb disappear 

VP —> Verb NP prefer a morning flight 

VP —> Verb NP PP leave Boston in the morning 

VP —y Verb PP leaving on Thursday 

Verb phrases can be significantly more complicated than this. Many 
other kinds of constituents can follow the verb, such as an entire embedded 
comp e l n e tial sentence. These are called sentential complements: 

You [yp [y said \s there were two flights that were the cheapest ]]] 

You [yp [y said Lv you had a two hundred sixty six dollar fare]] 

Vvp I v Tell] \np me] \s how to get from the airport in Philadelphia to 
downtown]] 

I Vvp lv think \s I would like to take the nine thirty flight]] 

Here’s a rule for these: 

VP Verb S 

Another potential constituent of the VP is another VP. This is often the 
case for verbs like want, would like , try , intend , need: 

I want [yp to fly from Milwaukee to Orlando] 

Hi, I want [yp to arrange three flights] 

Hello, I’m trying [yp to find a flight that goes from Pittsburgh to Den¬ 
ver after two PM 

Recall from Chapter 8 that verbs can also be followed by particles , 
words that resemble a preposition but that combine with the verb to form a 
phrasal verb like take off). These particles are generally considered to be 
an integral part of the verb in a way that other post-verbal elements are not; 
phrasal verbs are treated as individual verbs composed of two words. 

While a verb phrase can have many possible kinds of constituents, not 
every verb is compatible with every verb phrase. For example, the verb want 
can either be used with an NP complement (/ want a flight. ..), or with an 
infinitive VP complement (/ want to fly to. ..). By contrast, a verb like find 
cannot take this sort of VP complement. (* I found to fly to Dallas). 
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This idea that verbs are compatible with different kinds of comple¬ 
ments is a very old one; traditional grammar distinguishes between tran¬ 
sitive verbs like find, which take a direct object NP (I found a flight), and 
intransitive verbs like disappear, which do not (*/ disappeared a flight). 

Where traditional grammars subcategorize verbs into these two cate¬ 
gories (transitive and intransitive), modem grammars distinguish as many as 
100 subcategories. (In fact tagsets for many such subcategorization frames 
exists; see (Macleod et al, 1998) for the COMLEX tagset, Sanfilippo (1993) 
for the ACQUILEX tagset, and further discussion in Chapter 11). We say 
that a verb like find subcategorizes for an NP, while a verb like want sub¬ 
categorizes for either an NP or an infinite VP. We also call these constituents 
the complements of the verb (hence our use of the term sentential comple¬ 
ment above). So we say that want can take a VP complement. These pos¬ 
sible sets of complements are called the subcategorization frame for the 
verb. Another way of talking about the relation between the verb and these 
other constituents is to think of the verb as a predicate and the constituents 
as arguments of the predicate. So we can think of such predicate-argument 
relations as find (I, A flight), or want (I, to fly). We will talk more 
about this view of verbs and arguments in Chapter 14 when we talk about 
predicate calculus representations of verb semantics. 

Here are some subcategorization frames and example verbs: 


TRANSITIVE 

INTRANSITIVE 

SUBCATEGO¬ 

RIZE 


SUBCATEGO¬ 

RIZATION 

FRAMES 


SUBCATEGO¬ 

RIZES 

FOR 


COMPLE¬ 

MENTS 


SUBCATEGO¬ 

RIZATION 

FRAME 


Frame 

Verb 

Example 

0 

eat, sleep 

I want to eat 

NP 

prefer, find, leave, 

Find the flight from Pittsburgh to Boston 

NPNP 

show, give 

Show me airlines with flights from Pittsburgh 

^^from ^io 

fly, travel 

I would like to fly, from Boston to Philadelphia 

NPPP wit h 

help, load, 

Can you help [jyp me] [jyp with a flight] 

VPto 

prefer, want, need 

I would prefer [yp to to go by United airlines] 

VPbrst 

can, would, might 

I can [ vpbrst £° f rom Boston] 

S 

mean 

Does this mean [ s AA has a hub in Boston]? 


Note that a verb can subcategorize for a particular type of verb phrase, 
such as a verb phrase whose verb is an infinitive ( VPto), or a verb phrase 
whose verb is a bare stem (uninflected: VPbrst). Note also that a single verb 
can take different subcategorization frames. The verbid, for example, can 
take an NP NP frame (find me a flight) as well as an NP frame. 

How can we represent the relation between verbs and their comple¬ 
ments in a context-free grammar? One thing we could do is to do what we 
did with agreement features: make separate subtypes of the class Verb (Verb- 
with-NP-complement Verb-with-Inf-VP-complement Verb-with-S-complement 
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Verb-with-NP-plus-PP-complement , and so on): 

Verb-with-NP-complement —> find \ leave \ repeat | ... 

Verb-with-S-complement —> think \ believe \ say \ ... 
Verb-with-Inf-VP-complement —> want \ try \ need \ ... 

Then each of our VP rules could be modified to require the appropriate verb 
subtype: 

VP Verb-with-no-complement disappear 

VP —> Verb-with-NP-comp NP prefer a morning flight 

VP —> Verb-with-S-comp S said there were two flights 

The problem with this approach, as with the same solution to the agree¬ 
ment feature problem, is a vast explosion in the number of rules. The stan¬ 
dard solution to both of these problems is the feature structure, which will 
be introduced in Chapter 11. Chapter 11 will also discuss the fact that nouns, 
adjectives, and prepositions can subcategorize for complements just as verbs 
can. 


9.8 Auxiliaries 

auxiliaries The subclass of verbs called auxiliaries or helping verbs have particular 
syntactic constraints which can be viewed as a kind of subcategorization. 
modal Auxiliaries include the modal verbs can, could, may, might, must, will, 
perfect would, shall, and should , the perfect auxiliary have , the progressive auxil- 

procres- iary an( j thg passive auxiliary be. Each of these verbs places a constraint 

passive on the form of the following verb, and each of these must also combine in a 
particular order. 

Modal verbs subcategorize for a YP whose head verb is a bare stem, 
e.g. can go in the morning , will try to find a flight. The perfect verb have 
subcategorizes for a YP whose head verb is the past participle form: have 
booked 3 flights. The progressive verb be subcategorizes for a VP whose 
head verb is the gerundive participle: am going from Atlanta. The passive 
verb be subcategorizes for a YP whose head verb is the past participle: was 
delayed by inclement weather. 
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A sentence can have multiple auxiliary verbs, but they must occur in a 
particular order: 

modal < perfect < progressive < passive 

Here are some examples of multiple auxiliaries: 

modal perfect could have been a contender 

modal passive will be married 

perfect progressive have been feasting 

modal perfect passive might have been prevented 

Auxiliaries are often treated just like verbs such as want , seem , or in¬ 
tend , which subcategorize for particular kinds of YP complements. Thus 
can would be listed in the lexicon as a verb-with-bare-stem-VP-complement. 

One way of capturing the ordering constraints among auxiliaries, commonly 
used in the systemic grammar of Halliday (1985a), is to introduce a special grammar 
constituent called the verb group, whose subconstituents include all the aux- verb group 
diaries as well as the main verb. Some of the ordering constraints can also 
be captured in a different way. Since modals, for example, do not having 
a progressive or participle form, they simply will never be allow to follow 
progressive or passive be or perfect have. Exercise 9.8 asks the reader to 
write grammar rules for auxiliaries. 

The passive construction has a number of properties that make it differ¬ 
ent than other auxiliaries. One important difference is a semantic one; while 
the subject of non-passive (active) sentence is often the semantic agent of active 
the event described by the verb (/ prevented a catastrophe ) the subject of 
the passive is often the undergoer or patient of the event (a catastrophe was 
prevented). This will be discussed further in Chapter 15. 

9.9 Spoken Language Syntax 

The grammar of written English and the grammar of conversational spoken 
English share many features, but also differ in a number of respects. This 
section gives a quick sketch of a number of the characteristics of the syntax 
of spoken English. 

We usually use the term utterance rather than sentence for the units utterance 
of spoken language. Figure 9.5 shows some sample spoken ATIS utterances 
that exhibit many aspects of spoken language grammar. 

This is a standard style of transcription used in transcribing speech 
corpora for speech recognition. The comma V marks a short pause, each 
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the . [exhale] . . . [inhale] . . [uh] does American airlines . offer any 
. one way flights . [uh] one way fares, for one hundred and sixty one 
dollars 

[mm] i’d like to leave i guess between [um] . [smack] . five o’clock no, 
five o’clock and [uh], seven o’clock . P M 
around, four, P M 

all right, [throat_clear] . . i’d like to know the . give me the flight. times 
. in the morning . for September twentieth . nineteen ninety one 
[uh] one way 
[uh] seven fifteen, please 

on United airlines . . give me, the . . time . . from New York . [smack] 

. to Boise-, to . I’m sorry . on United airlines . [uh] give me the flight, 
numbers, the flight times from . [uh] Boston . to Dallas _ 

Figure 9.5 Some sample spoken utterances from users interacting with the 
ATIS system. 


PROSODY 

PITCH 

CONTOUR 

STRESS 

PATTERN 


period V marks a long pause, and the square brackets ‘[uh]’ mark non¬ 
verbal events (breaths, lipsmacks, uhs and urns). 

There are a number of ways these utterances differ from written En¬ 
glish sentences. One is in the lexical statistics; for example spoken English 
is much higher in pronouns than written English; the subject of a spoken 
sentence is almost invariably a pronoun. Another is in the presence of var¬ 
ious kinds of disfluencies (hesitations, repairs, restarts, etc) to be discussed 
below. Spoken sentences often consist of short fragments or phrases {one 
way or around four PM , which are less common in written English. 

Finally, these sentences were spoken with a particular prosody. The 
prosody of an utterance includes its particular pitch contour (the rise and 
fall of the fundamental frequency of the soundwave), its stress pattern or 
rhythm (the series of stressed and unstressed syllables that make up a sen¬ 
tence) and other similar factors like the rate (speed) of speech. 


Disfluencies 

Perhaps the most salient syntactic feature that distinguishes spoken and writ¬ 
ten language is the class of phenomena known as disfluencies. Disfluencies 
include the use of uh and urn, word repetitions, and false starts. The ATIS 
sentence in Figure 9.6 shows examples of a false start and the use of uh . The 
false start here occurs when the speaker starts by asking for one-way flights. 
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and then stops and corrects herself, beginning again and asking about one¬ 
way fares. 


Interruption Point 

■ 


Does American airlines offer any one-way flights * [uh] 

one-way fares for 160 dollars? 



Reparandum Interregnum 

Repair 

Figure 9.6 An example of a disfluency (after Shriberg (1994)). 


The segment one-way flights is referred to as the reparandum, and the reparandum 
replacing sequence one-way fares is referred to as the repair (these terms are repair 

INTERRUP- 

from Levelt (1983)). The interruption point, where the speaker breaks off tion t 
the original word sequence, here occurs right after the word flights'. 

The words uh and urn (sometimes called filled pauses) can be treated in 
the lexicon like regular words, and indeed this is often how they are modeled 
in speech recognition. The HMM pronunciation lexicons in speech recog¬ 
nizers often include pronunciation models of these words, and the A-gram 
grammar used by recognizers include the probabilities of these occurring 
with other words. 

For speech understanding, where our goal is to build a meaning for the 
input sentence, it may be useful to detect these restarts in order to edit out 
what the speaker probably considered the ‘corrected’ words. For example in 
the sentence above, if we could detect that there was a restart, we could just 
delete the reparandum, and parse the remaining parts of the sentence: 

Does American airlines offer any one-way flights uh one-way fares 
for 160 dollars? 

How do disfluencies interact with the constituent structure of the sen¬ 
tence? Hindle (1983) showed that the repair often has the same structure 
as the constituent just before the interruption point. Thus in the example 
above, the repair is a PP, as is the reparandum. This means that if it is pos¬ 
sible to automatically find the interruption point, it is also often possible to 
automatically detect the boundaries of the reparandum. 

9.10 Grammar Equivalence & Normal Form 

A formal language is defined as a (possibly infinite) set of strings of words. 

This suggests that we could ask if two grammars are equivalent by asking if 
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they generate the same set of strings. In fact it is possible to have two distinct 
context-free grammars generate the same language. 

We usually distinguish two kinds of grammar equivalence: weak equiv¬ 
alence and strong equivalence. Two grammars are strongly equivalent if 
they generate the same set of string and if they assign the same phrase struc¬ 
ture to each sentence (allowing merely for renaming of the non-terminal 
symbols). Two grammars are weakly equivalent if they generate the same 
set of strings but do not assign the same phrase structure to each sentence. 
form al It is sometimes useful to have a normal form for grammars, in which 

each of the productions takes a particular form. For example a context-free 
normal grammar is in Chomsky normal form (CNF) (Chomsky, 1963) if it is £-free 
and if in addition each production is either of the form A —» B C or A —> a. 
That is, the righthand side of each rule either has two non-terminal symbols 
or one terminal symbol. Chomsky normal form grammars have binary trees 
(down to the prelexical nodes), which can be useful for certain algorithms. 

Any grammar can be converted into a weakly-equivalent Chomsky nor¬ 
mal form grammar. For example a rule of the form 

A ^ B C D 

can be converted into the following two CNF rules: 

A -> B X 

X -> C D 

Exercise 9.11 asks the reader to formulate the complete algorithm. 

9.11 Finite State & Context-Free Grammars 

We argued in Section 9.1 that a complex model of grammar would have to 
represent constituency. This is one reason that finite-state models of gram¬ 
mar are often inadequate. Now that we have explored some of the details of 
the syntax of noun phrases, we are prepared to discuss another problem with 
recursion finite-state grammars. This problem is recursion. Recursion in a grammar 
occurs when an expansion of a non-terminal includes the non-terminal itself, 
as we saw in rules like Nominal Nominal PP in the previous section. 

In order to see why this is a problem for finite-state grammars, let’s 
first attempt to build a finite-state model for some of the grammar rules we 
have seen so far. For example, we could model the noun phrase up to the 
head with a regular expression (= FSA) as follows: 
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(Det) (Card) (Ord) (Quant) (AP) Nominal 

What about the postmodifiers? Let’s just try adding the PP. We could 
then augment the regular expression as follows: 

(Det) (Card) (Ord) (Quant) (AP) Nominal (PP)* 

So to complete this regular expression we just need to expand inline 
the definition of PP , as follows: 

(Det) (Card) (Ord) (Quant) (AP) Nominal (P NP)* 

But wait; our definition of NP now presupposes an NP\ We would need 
to expand the rule as follows: 

(Det) (Card) (Ord) (Quant) (AP) Nominal (P (Det) (Card) (Ord) 

(Quant) (AP) Nominal (P NP))* 

But of course the NP is back again! The problem is that NP is a re¬ 
cursive rule. There is actually a sneaky way to ‘unwind’ this particular 
right-recursive rule in a finite-state automaton. In general, however, recur¬ 
sion cannot be handled in finite automata, and recursion is quite common 
in a complete model of the NP (for example for RelClause and GerundVP , 
which also have NP in their expansion): 

(Det) (Card) (Ord) (Quant) (AP)Nominal (RelClause\GerundVP\PP)* 

In particular, Chomsky (1959a) proved that a context-free language L 
can be generated by a finite automaton if and only if there is a context-free 
grammar that generates L that does not have any center-embedded recur¬ 
sions (recursions of the form A —> a A (3). 

While it thus seems at least likely that we can’t model all of English 
syntax with a finite state grammar, it is possible to build an FSA that approx¬ 
imates English (for example by expanding only a certain number of NPs). In 
fact there are algorithms for automatically generating finite-state grammars 
that approximate context-free grammars (Pereira and Wright, 1997). 

Chapter 10 will discuss an augmented version of the finite-state au¬ 
tomata called the recursive transition network or RTN that adds the com¬ 
plete power of recursion to the FSA. The resulting machine is exactly iso¬ 
morphic to the context-free grammar, and can be a useful metaphor for 
studying CFGs in certain circumstances. 


RECURSIVE 

RULE 
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9.12 Grammars & Human Processing 

Do people use context-free grammars in their mental processing of lan¬ 
guage? It has proved very difficult to find clear-cut evidence that they do. 
For example, some early experiments asked subjects to judge which words 
in a sentence were more closely connected (Levelt, 1970), finding that their 
intuitive group corresponded to syntactic constituents. Other experimenters 
examined the role of constituents in auditory comprehension by having sub¬ 
jects listen to sentences while also listening to short “clicks” at different 
times. Fodor and Bever (1965) found that subjects often mis-heard the 
clicks as if they occurred at constituent boundaries. They argued that the 
constituent was thus a ‘perceptual unit’ which resisted interruption. Unfor¬ 
tunately there were severe methodological problems with the click paradigm 
(see for example Clark and Clark (1977) for a discussion). 

A broader problem with all these early studies is that they do not con¬ 
trol for the fact that constituents are often semantic units as well as syntactic 
units. Thus, as will be discussed further in Chapter 15, a single odd block is a 
constituent (an NP) but also a semantic unit (an object of type block which 
has certain properties). Thus experiments which show that people notice the 
boundaries of constituents could simply be measuring a semantic rather than 
a syntactic fact. 

Thus it is necessary to find evidence for a constituent which is not 
a semantic unit. Furthermore, since there are many non-constituent-based 
theories of grammar based on lexical dependencies, it is important to find 
evidence that cannot be interpreted as a lexical fact; i.e. evidence for con¬ 
stituency that is not based on particular words. 

One suggestive series of experiments arguing for constituency has come 
from Kathryn Bock and her colleagues. Bock and Loebell (1990), for exam¬ 
ple, avoided all these earlier pitfalls by studying whether a subject who uses 
a particular syntactic constituent (for example a verb-phrase of a particular 
type, like V NP PP\ is more likely to use the constituent in following sen¬ 
tences. In other words, they asked whether use of a constituent primes its 
use in subsequent sentences. As we saw in previous chapters, priming is a 
common way to test for the existence of a mental structure. Bock and Loe¬ 
bell relied on the English ditransitive alternation. A ditransitive verb is one 
like give which can take two arguments: 

(9.17) The wealthy widow gave |jv p the church] |yv/> her Mercedes]. 

The verb give allows another possible subcategorization frame, called 
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a prepositional dative in which the indirect object is expressed as a prepo¬ 
sitional phrase: 

(9.18) The wealthy widow gave [yy p her Mercedes] \ PP to the church]. 

As we discussed on page 339, many verbs other than give have such 
alternations (send, sell , etc; see Levin (1993) for a summary of many dif- tions na " 
ferent alternation patterns). Bock and Loebell relied on these alternations by 
giving subjects a picture, and asking them to describe it in one sentence. The 
picture was designed to elicit verbs like give or sell by showing an event such 
as a boy handing an apple to a teacher. Since these verbs alternate, subjects 
might, for example, say The boy gave the apple to the teacher or The boy 
gave the teacher an apple. 

Before describing the picture, subjects were asked to read an unrelated 
‘priming’ sentence out loud; the priming sentences either had V NP NP or 
V NP PP structure. Crucially, while these priming sentences had the same 
constituent structure as the dative alternation sentences, they did not have the 
same semantics. For example, the priming sentences might be prepositional 
locatives , rather than datives : 

(9.19) IBM moved [j^ P a bigger computer] [ PP to the Sears store]. 

Bock and Loebell found that subjects who had just read a V NP PP 
sentence were more like to use a VNP PP structure in describing the picture. 

This suggested that the use of a particular constituent primed the later use of 
that constituent, and hence that the constituent must be mentally represented 
in order to prime and be primed. 

In more recent work, Bock and her colleagues have continued to find 
evidence for this kind of constituency structure. 

There is a quite different disagreement about the human use of context- 
free grammars. Many researchers have suggested that natural language is 
unlike a formal language, and in particular that the set of possible sentences 
in a language cannot be described by purely syntactic context-free grammar 
productions. They argue that a complete model of syntactic structure will 
prove to be impossible unless it includes knowledge from other domains 
(for example like semantic, intonational, pragmatic, and social/interactional 
domains). Others argue that the syntax of natural language can be repre¬ 
sented by formal languages. This second position is called modularist: re- modularist 
searchers holding this position argue that human syntactic knowledge is a 
distinct module of the human mind. The first position, in which grammatical 
knowledge may incorporate semantic, pragmatic, and other constraints, is 
called anti-modularist. We will return to this debate in Chapter 15. 


ANTI- 

MODULARIST 
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9.13 Summary 


This chapter has introduced a number of fundamental concepts in syntax via 

the context-free grammar. 

• In many languages, groups of consecutive words act as a group or a 
constituent, which can be modeled by context-free grammars (also 
known as phrase-structure grammars. 

• A context-free grammar consists of a set of rules or productions, ex¬ 
pressed over a set of non-terminal symbols and a set of terminal sym¬ 
bols. Formally, a particular context-free language is the set of strings 
which can be derived from a particular context-free grammar. 

• A generative grammar is a traditional name in linguistics for a formal 
language which is used to model the grammar of a natural language. 

• There are many sentence-level grammatical constructions in English; 
declarative, imperative, yes-no-question, and wh-question are four 
very common types, which can be modeled with context-free rules. 

• An English noun phrase can have determiners, numbers, quanti¬ 
fiers, and adjective phrases preceding the head noun, which can be 
followed by a number of postmodifiers; gerundive VPs, infinitives 
VPs, and past participial VPs are common possibilities. 

• Subjects in English agree with the main verb in person and number. 

• Verbs can be subcategorized by the types of complements they ex¬ 
pect. Simple subcategories are transitive and intransitive; most gram¬ 
mars include many more categories than these. 

• The correlate of sentences in spoken language are generally called ut¬ 
terances. Utterances may be disfluent, containing filled pauses like 
um and ulu restarts, and repairs. 

• Any context-free grammar can be converted to Chomsky normal form, 
in which the right-hand-side of each rule has either two non-terminals 
or a single terminal. 

• Context-free grammars are more powerful than finite-state automata, 
but it is nonetheless possible to approximate a context-free grammar 
with a FSA. 

• There is some evidence that constituency plays a role in the human 
processing of language. 
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Bibliographical and Historical Notes 

“den sprachlichen Ausdruck fur die willkiirliche Gliederung einer Ge- 
sammtvorstellung in ihre in logische Beziehung zueinander gesetzten 
Bestandteile” 

“the linguistic expression for the arbitrary division of a total idea into 
its constituent parts placed in logical relations to one another” 

Wundt’s (1900:240) definition of the sentence; the origin of 
the idea of phrasal constituency, cited in Percival (1976). 


The recent historical research of Percival (1976) has made it clear 
that this idea of breaking up a sentence into a hierarchy of constituents ap¬ 
peared in the Volkerpsychologie of the groundbreaking psychologist Wil¬ 
helm Wundt (Wundt, 1900). By contrast, traditional European grammar, 
dating from the Classical period, defined relations between words rather than 
constituents. Wundt’s idea of constituency was taken up into linguistics by 
Leonard Bloomfield in his early book An Introduction to the Study of Lan¬ 
guage (Bloomfield, 1914). By the time of his later book Language (Bloom¬ 
field, 1933), what was then called ‘immediate-constituent analysis’ was a 
well-established method of syntactic study in the United States. By contrast, 
European syntacticians retained an emphasis on word-based or dependency 
grammars; Chapter 12 discusses some of these issues in introducing depen¬ 
dency grammar. 

American Structuralism saw a number of specific definitions of the 
immediate constituent, couched in terms of their search for a ‘discovery pro¬ 
cedure’; a methodological algorithm for describing the syntax of a language. 
In general, these attempt to capture the intuition that “The primary criterion 
of the immediate constituent is the degree in which combinations behave as 
simple units” (Bazell, 1952, p. 284). The most well-known of the specific 
definitions is Harris’ idea of distributional similarity to individual units, with 
the substitutability test. Essentially, the method proceeded by breaking up 
a construction into constituents by attempting to substitute simple structures 
for possible constituents — if a substitution of a simple form, say man , was 
substitutable in a construction for a more complex set (like intense young 
man), then the form intense young man was probably a constituent. Har¬ 
ris’s test was the beginning of the intuition that a constituent is a kind of 
equivalence class. 
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The first formalization of this idea of hierarchical constituency was 
the phrase-structure grammar defined in Chomsky (1956), and further ex¬ 
panded upon (and argued against) in Chomsky (1957) and Chomsky (1975). 
From this time on, most generative linguistic theories were based at least 
in part on context-free grammars (such as Head-Driven Phrase Structure 
Grammar (Pollard and Sag, 1994), Lexical-Functional Grammar (Bresnan, 
1982), Government and Binding (Chomsky, 1981), and Construction Gram¬ 
mar (Kay and Fillmore, 1999), inter alia); many of these theories used 
schematic context-free templates known as X-bar schemata. 

Shortly after Chomsky’s initial work, the context-free grammar was re¬ 
discovered by Backus (1959) and independently by Naur etal (1960) in their 
descriptions of the ALGOL programming language; Backus (1996) noted 
that he was influenced by the productions of Emil Post and that Naur’s work 
was independent of his (Backus’) own. After this early work, a great num¬ 
ber of computational models of natural language processing were based on 
context-free grammars because of the early development of efficient algo¬ 
rithms to parse these grammars (see Chapter 10). 

As we have already noted, grammars based on context-free rules are 
not ubiquitous. One extended formalism is Tree Adjoining Grammar (TAG) 
(Joshi, 1985). The primary data structure in Tree Adjoining Grammar is the 
tree, rather than the rule. Trees come in two kinds; initial trees and auxiliary 
trees. Initial trees might, for example, represent simple sentential structures, 
while auxiliary trees are used to add recursion into a tree. Trees are combined 
by two operations called substitution and adjunction. See Joshi (1985) for 
more details. An extension of Tree Adjoining Grammar called Lexicalized 
Tree Adjoining Grammars will be discussed in Chapter 12. 

Another class of grammatical theories that are not based on context- 
free grammars are instead based on the relation between words rather than 
constituents. Various such theories have come to be known as dependency 
grammars; representative examples include the dependency grammar of 
Mel’cuk (1979), the Word Grammar of Hudson (1984), or the Constraint 
Grammar of Karlsson et al (1995). Dependency-based grammars have re¬ 
turned to popularity in modem statistical parsers, as the field have come to 
understand the crucial role of word-to-word relations; see Chapter 12 for 
further discussion. 

Readers interested in general references grammars of English should 
waste no time in getting hold of Quirk et al. (1985a). Other useful treatments 
include McCawley (1998). 
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There are many good introductory textbook on syntax. Sag and Wasow 
(1999) is an introduction to formal syntax, focusing on the use of phrase- syntax 1 
structure, unification, and the type-hierarchy in Head-Driven Phrase Struc¬ 
ture Grammar, van Valin (1999) is an introduction from a less formal, more 
functional perspective, focusing on cross-linguistic data and on the func¬ 
tional motivation for syntactic structures. 


Exercises 

9.1 Draw tree structures for the following ATIS phrases: 

a. Dallas 

b. from Denver 

c. after five p.m. 

d. arriving in Washington 

e. early flights 

f. all redeye flights 

g. on Thursday 

h. a one-way fare 

i. any delays in Denver 

9.2 Draw tree structures for the following ATIS sentences: 

a. Does American airlines have a flight between five a.m. and six a.m. 

b. I would like to fly on American airlines. 

c. Please repeat that. 

d. Does American 487 have a first class section? 

e. I need to fly between Philadelphia and Atlanta. 

f. What is the fare from Atlanta to Denver? 

g. Is there an American airlines flight from Philadelphia to Dallas? 

9.3 Augment the grammar rules on page 337 to handle pronouns. Deal 
properly with person and case. 
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POSSESSIVE 

GENITIVE 


9.4 Modify the noun phrase grammar of Sections 9.4-9.6 to correctly model 
mass nouns and their agreement properties 

9.5 How many types of NPs would rule (9.10) on page 332 expand to if we 
didn’t allow parentheses in our grammar formalism? 

9.6 Assume a grammar that has many YPs rules for different subcatego¬ 
rization, as expressed in Section 9.7, and differently subcategorized verb 
rules like Verb-with-NP-complement. How would the rule for post-nominal 
relative clauses (9.12) need to be modified if we wanted to deal properly with 
examples like the earliest flight that you have ? Recall that in such examples 
the pronoun that is the object of the verb get. Your rules should allow this 
noun phrase but should correctly rule out the ungrammatical S */ get. 

9.7 Does your solution to the previous problem correctly model the NP the 
earliest flight that I can gefl How about the earliest flight that I think my 
mother wants me to book for her ? Hint: this phenomenon is called long¬ 
distance dependency. 

9.8 Write rules expressing the verbal subcategory of English auxiliaries; 
for example you might have a rule can verb-with-bare-stem-VP-complement. 

9.9 NPs like Fortune's office or my uncle's marks are called possessive or 
genitive noun phrases. A possessive noun phrase can be modeled by treated 
the sub-NP like Fortune's or my uncle's as a determiner of the following head 
noun. Write grammar rules for English possessives. You may treat 's as if it 
were a separate word (i.e. as if there were always a space before 's). 

9.10 Page 330 discussed the need for a Wh-NP constituent. The simplest 
Wh-NP is one of the wh-pronouns (who, whom, whose, which). The Wh- 
words, what and which can be determiners: which four will you have?, what 
credit do you have with the Duke?. Write rules for the different types of 
Wh-NPs. 

9.11 Write an algorithm for converting an arbitrary context-free grammar 
into Chomsky normal form. 




PARSING WITH 

CONTEXT-FREE 

GRAMMARS 


There are and can exist but two ways of investigating and dis¬ 
covering truth. The one hurries on rapidly from the senses and 
particulars to the most general axioms , and from them... derives 
and discovers the intermediate axioms. The other constructs its 
axioms from the senses and particulars y by ascending continu¬ 
ally and gradually } till it finally arrives at the most general ax¬ 
ioms. 

Francis Bacon, Novum Organum Book 1.19 (1620) 


By the 17th century, the western philosophical tradition had begun 
to distinguish two important insights about human use and acquisition of 
knowledge. The empiricist tradition, championed especially in Britain, by empiricist 
Bacon and Locke, focused on the way that knowledge is induced and rea¬ 
soning proceeds based on data and experience from the external world. The 
rationalist tradition, championed especially on the Continent by Descartes rationalist 
but following a tradition dating back to Plato’s Meno, focused on the way 
that learning and reasoning is guided by prior knowledge and innate ideas. 

This dialectic continues today, and has played a important role in char¬ 
acterizing algorithms for parsing. We defined parsing in Chapter 3 as a 
combination of recognizing an input string and assigning some structure to 
it. Syntactic parsing, then, is the task of recognizing a sentence and assigning 
a syntactic structure to it. This chapter focuses on the kind of structures as¬ 
signed by the context-free grammars of Chapter 9. Since context-free gram¬ 
mars are a declarative formalism, they don’t specify how the parse tree for 
a given sentence should be computed. This chapter will, therefore, present 
some of the many possible algorithms for automatically assigning a context- 
free (phrase structure) tree to an input sentence. 
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Parse trees are directly useful in applications such as grammar check¬ 
ing in word-processing systems; a sentence which cannot be parsed may 
have grammatical errors (or at least be hard to read). In addition, parsing is 
an important intermediate stage of representation for semantic analysis (as 
we will see in Chapter 15), and thus plays an important role in applications 
like machine translation, question answering, and information extrac¬ 
tion. For example, in order to answer the question 


What books were written by British women authors before 1800? 


we’ll want to know that the subject of the sentence was what books and that 
the by-adjunct was British women authors to help us figure out that the user 
wants a list of books (and not just a list of authors). Syntactic parsers are also 
used in lexicography applications for building on-line versions of dictionar¬ 
ies. Finally, stochastic versions of parsing algorithms have recently begun to 
be incorporated into speech recognizers, both for language models (Ney, 
1991) and for non-finite-state acoustic and phonotactic modeling (Lari and 
Young, 1991). 

The main parsing algorithm presented in this chapter is the Earley al¬ 
gorithm (Earley, 1970), one of the context-free parsing algorithms based on 
dynamic programming. We have already seen a number of dynamic pro¬ 
gramming algorithms - Minimum-Edit-Distance, Yiterbi, Forward. The Ear- 
ley algorithm is one of three commonly-used dynamic programming parsers; 
the others are the Cocke-Younger-Kasami (CYK) algorithm which we will 
present in Chapter 12, and the Graham-Harrison-Ruzzo (GHR) (Graham 
et al , 1980) algorithm. Before presenting the Earley algorithm, we begin by 
motivating various basic parsing ideas which make up the algorithm. First, 
we revisit the ‘search metaphor’ for parsing and recognition, which we in¬ 
troduced for finite-state automata in Chapter 2, and talk about the top-down 
and bottom-up search strategies. We then introduce a ‘baseline’ top-down 
backtracking parsing algorithm, to introduce the idea of simple but efficient 
parsing. While this parser is perspicuous and relatively efficient, it is unable 
to deal efficiently with the important problem of ambiguity: a sentence or 
words which can have more than one parse. The final section of the chapter 
then shows how the Earley algorithm can use insights from the top-down 
parser with bottom-up filtering to efficiently handle ambiguous inputs. 
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10.1 Parsing as Search 

Chapters 2 and 3 showed that finding the right path through a finite-state 
automaton, or finding the right transduction for an input, can be viewed as 
a search problem. For FSAs, for example, the parser is searching through 
the space of all possible paths through the automaton. In syntactic parsing, 
the parser can be viewed as searching through the space of all possible parse 
trees to find the correct parse tree for the sentence. Just as the search space of 
possible paths was defined by the structure of the FSA, so the search space 
of possible parse trees is defined by the grammar. For example, consider the 
following ATIS sentence: 

(10.1) Book that flight. 

Using the miniature grammar and lexicon in Figure 10.2, which con¬ 
sists of some of the CFG rules for English introduced in Chapter 9, the cor¬ 
rect parse tree that would be would assigned to this example is shown in 
Figure 10.1. 



How can we use the grammar in Figure 10.2 to assign the parse tree in 
Figure 10.1 to Example (10.1)? (In this case there is only one parse tree, but 
it is possible for there to be more than one.) The goal of a parsing search is to 
find all trees whose root is the start symbol 5, which cover exactly the words 
in the input. Regardless of the search algorithm we choose, there are clearly 
two kinds of constraints that should help guide the search. One kind of 
constraint comes from the data, i.e. the input sentence itself. Whatever else 
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S ^ NP VP 

Det —> that this a 

S -4 AuxNPVP 

Noun —> book flight meal money 

S -4 VP 

Verb book include prefer 

NP -4 Det Nominal 

Aux —> does 

Nominal Noun 


Nominal -4 Noun Nominal 

Prep from \ to \ on 

NP -4 Proper-Noun 

Proper-Noun Houston TWA 

VP -4 Verb 


VP -4 VerbNP 

Nominal Nominal PP 

Figure 10.2 A miniature English grammar and lexicon. 


is true of the final parse tree, we know that there must be three leaves, and 
they must be the words book, that, and flight. The second kind of constraint 
comes from the grammar. We know that whatever else is true of the final 
parse tree, it must have one root, which must be the start symbol S. 

These two constraints, recalling the empiricist/rationalist debate de¬ 
scribed at the beginning of this chapter, give rise to the two search strategies 
underlying most parsers: top-down or goal-directed search and bottom-up 
or data-directed search. 

Top-Down Parsing 

top-down A top-down parser searches for a parse tree by trying to build from the root 
node S down to the leaves. Let’s consider the search space that a top-down 
parser explores, assuming for the moment that it builds all possible trees in 
parallel. The algorithm starts by assuming the input can be derived by the 
designated start symbol S. The next step is to find the tops of all trees which 
can start with S , by looking for all the grammar rules with S on the left-hand 
side. In the grammar in Figure 10.2, there are three rules that expand S, so 

ply the second ply, or level, of the search space in Figure 10.3 has three partial 

trees. 

We next expand the constituents in these three new trees, just as we 
originally expanded S. The first tree tells us to expect an NP followed by a 
VP, the second expects an Aux followed by an NP and a VP, and the third a 
VP by itself. To fit the search space on the page, we have shown in the third 
ply of Figure 10.3 only the trees resulting from the expansion of the left-most 
leaves of each tree. At each ply of the search space we use the right-hand- 
sides of the rules to provide new sets of expectations for the parser, which 
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NP 


YP 


AUX NP VP 


S 

YP 



YP NP 


YP Aux NP YP Aux NP YP YP YP 


Det 


Nom 


PropN 


Det 


Nom 


PropN Y NP Y 


Figure 10.3 An expanding top-down search space. Each ply is created by 
taking each tree from the previous ply, replacing the leftmost non-terminal 
with each of its possible expansions, and collecting each of these trees into a 
new ply. 


are then used to recursively generate the rest of the trees. Trees are grown 
downward until they eventually reach the part-of-speech categories at the 
bottom of the tree. At this point, trees whose leaves fail to match all the 
words in the input can be rejected, leaving behind those trees that represent 
successful parses. 

In Figure 10.3, only the 5th parse tree (the one which has expanded 
the rule VP -A Verb NP) will eventually match the input sentence Book that 
flight. The reader should check this for themselves in Figure 10.1. 

Bottom-Up Parsing 

Bottom-up parsing is the earliest known parsing algorithm (it was first sug- bottom-up 
gested by Yngve (1955)), and is used in the shift-reduce parsers common 
for computer languages (Aho and Ullman, 1972). In bottom-up parsing, the 
parser starts with the words of the input, and tries to build trees from the 
words up, again by applying rules from the grammar one at a time. The 
parse is successful if the parser succeeds in building a tree rooted in the start 
symbol S that covers all of the input. Figure 10.4 show the bottom-up search 
space, beginning with the sentence Book that flight. The parser begins by 
looking up each word 0 book , that , and flight) in the lexicon and building 
three partial trees with the part of speech for each word. But the word book 
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is ambiguous; it can be a noun or a verb. Thus the parser must consider 
two possible sets of trees. The first two plies in Figure 10.4 show this initial 
bifurcation of the search space. 



Each of the trees in the second ply are then expanded. In the parse 
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on the left (the one in which book is incorrectly considered a noun), the 
Nominal —> Noun rule is applied to both of the Nouns (book and flight). This 
same rule is also applied to the sole Noun (flight) on the right, producing the 
trees on the third ply. 

In general, the parser extends one ply to the next by looking for places 
in the parse-in-progress where the right-hand-side of some rule might fit. 
This contrasts with the earlier top-down parser, which expanded trees by ap¬ 
plying rules when their left-hand side matched an unexpanded nonterminal. 
Thus in the fourth ply, in the first and third parse, the sequence Det Nominal 
is recognized as the right-hand side of the NP —> Det Nominal rule. 

In the fifth ply, the interpretation of book as a noun has been pruned 
from the search space. This is because this parse cannot be continued: there 
is no rule in the grammar with the right-hand side Nominal NP. 

The final ply of the search space (not shown in Figure 10.4) is the 
correct parse tree (see Figure 10.1). Make sure you understand which of the 
two parses on the penultimate ply gave rise to this parse. 

Comparing Top-down and Bottom-up Parsing 

Each of these two architectures has its own advantages and disadvantages. 
The top-down strategy never wastes time exploring trees that cannot result 
in an S, since it begins by generating just those trees. This means it also 
never explores subtrees that cannot find a place in some S-rooted tree. In the 
bottom-up strategy, by contrast, trees that have no hope of leading to an 5, 
or fitting in with any of their neighbors, are generated with wild abandon. 
For example the left branch of the search space in Figure 10.4 is completely 
wasted effort; it is based on interpreting book as a Noun at the beginning of 
the sentence despite the fact no such tree can lead to an S given this grammar. 

The top-down approach has its own inefficiencies. While it does not 
waste time with trees that do not lead to an 5, it does spend considerable 
effort on S trees that are not consistent with the input. Note that the first 
four of the six trees in the third ply in Figure 10.3 all have left branches that 
cannot match the word book. None of these trees could possibly be used 
in parsing this sentence. This weakness in top-down parsers arises from the 
fact that they can generate trees before ever examining the input. Bottom-up 
parsers, on the other hand, never suggest trees that are not at least locally 
grounded in the actual input. 

Neither of these approaches adequately exploits the constraints pre¬ 
sented by the grammar and the input words. In the next section, we present 
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a baseline parsing algorithm that incorporates features of both the top-down 
and bottom-up approaches. This parser is not as efficient as the Earley or 
CYK parsers we will introduce later, but it is useful for showing the basic 
operations of parsing. 


10.2 A Basic Top-down Parser 

There are any number of ways of combining the best features of top-down 
and bottom-up parsing into a single algorithm. One fairly straightforward 
approach is to adopt one technique as the primary control strategy used to 
generate trees and then use constraints from the other technique to filter out 
inappropriate parses on the fly. The parser we develop in this section uses a 
top-down control strategy augmented with a bottom-up filtering mechanism. 
Our first step will be to develop a concrete implementation of the top-down 
strategy described in the last section. The ability to filter bad parses based on 
bottom-up constraints from the input will then be grafted onto this top-down 
parser. 

In our discussions of both top-down and bottom-up parsing, we as- 
parallel sumed that we would explore all possible parse trees in parallel. Thus each 
ply of the search in Figure 10.3 and Figure 10.4 showed all possible expan¬ 
sions of the parse trees on the previous plies. Although it is certainly possible 
to implement this method directly, it typically entails the use of an unrealistic 
amount of memory to store the space of trees as they are being constructed. 
This is especially true since realistic grammars have much more ambiguity 
than the miniature grammar in Figure 10.2. 

strategy st A more reasonable approach is to use a depth-first strategy such as 

the one used to implement the various finite state machines in Chapter 2 and 
Chapter 3. The depth-first approach expands the search space incrementally 
by systematically exploring one state at a time. The state chosen for expan¬ 
sion is the most recently generated one. When this strategy arrives at a tree 
that is inconsistent with the input, the search continues by returning to the 
most recently generated, as yet unexplored, tree. The net effect of this strat¬ 
egy is a parser that single-mindedly pursues trees until they either succeed or 
fail before returning to work on trees generated earlier in the process. Figure 
10.5 illustrates such a top-down, depth-first derivation using Grammar 10.2. 

Note that this derivation is not fully determined by the specification of a 
top-down, depth-first strategy. There are two kinds of choices that have been 
left unspecified that can lead to different derivations: the choice of which 
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leaf node of a tree to expand and the order in which applicable grammar 
rules are applied. In this derivation, the left-most unexpanded leaf node of 
the current tree is being expanded first, and the applicable rules of the gram¬ 
mar are being applied according to their textual order in the grammar. The 
decision to expand the left-most unexpanded node in the tree is important 
since it determines the order in which the input words will be consulted as 
the tree is constructed. Specifically, it results in a relatively natural forward 
incorporation of the input words into a tree. The second choice of applying 
rules in their textual order has consequences that will be discussed later. 

Figure 10.6 presents a parsing algorithm that instantiates this top-down, 
depth-first, left-to-right strategy. This algorithm maintains an agenda of agenda 
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function Top-Down-Pars Eiinput, grammar) returns a parse tree 

agenda (Initial S tree, Beginning of input) 
current-search-state P op (agenda) 

loop 

if Successful-Parse 1 (current-search-state) then 
return Tree (current-search-state) 

else 

if CAT(NODE-To-ExPAND(current-search-state)) is aPOS then 
if Cat (node-to-expand) 

C 

POS(current-input {current-search-state)) then 
Push(Apply-Lexical-Rule (current-search-statef agenda) 

else 

return reject 
else 

Push(Apply-Rules (current-search-state, grammar ), agenda) 
if agenda is empty then 
return reject 
else 

current-search-state «— XEXT(agenda) 

end 


Figure 10.6 A top-down, depth-first left-to-right parser. 


search-states. Each search-state consists of partial trees together with a 
pointer to the next input word in the sentence. 

The main loop of the parser takes a state from the front of the agenda 
and produces a new set of states by applying all the applicable grammar rules 
to the left-most unexpanded node of the tree associated with that state. This 
set of new states is then added to the front of the agenda in accordance with 
the textual order of the grammar rules that were used to generate them. This 
process continues until either a successful parse tree is found or the agenda 
is exhausted indicating that the input can not be parsed. 

Figure 10.7 shows the sequence of states examined by this algorithm 
in the course of parsing the following sentence. 

(10.2) Does this flight include a meal? 

In this figure, the node currently being expanded is shown in a box, while 
the current input word is bracketed. Words to the left of the bracketed word 
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have already been incorporated into the tree. 

The parser begins with a fruitless exploration of the S —> NP VP rule, 
which ultimately fails because the word Does cannot be derived from any 
of the parts-of-speech that can begin an NP. The parser thus eliminates the 
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S —> NP VP rule. The next search-state on the agenda corresponds to the 
S —> Aux NP VP rule. Once this state is found, the search continues in a 
straightforward depth-first, left to right fashion through the rest of the deriva¬ 
tion. 
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Adding Bottom-up Filtering 

Figure 10.7 shows an important qualitative aspect of the top-down parser. 

Beginning at the root of the parse tree, the parser expands non-terminal sym¬ 
bols along the left edge of the tree, down to the word at the bottom left 
edge of the tree. As soon as a word is incorporated into a tree, the input 
pointer moves on, and the parser will expand the new next left-most open 
non-terminal symbol down to the new left comer word. 

Thus in any successful parse the current input word must serve as the 
first word in the derivation of the unexpanded node that the parser is currently 
processing. This leads to an important consequence which will be useful in 
adding bottom-up filtering. The parser should not consider any grammar 
rule if the current input cannot serve as the first word along the left edge of 
some derivation from this rule. We call the first word along the left edge of 
a derivation the left-corner of the tree. left-corner 


YP 


VP 

/// ^ N 


/ / 

/ / // ^^ 


\ 

Verb Det 

Noun Noun Verb 

Det Noun 

Noun 

prefer a 

morning flight prefer 

a morning 

flight 

Figure 10.9 An illustration of the left-comer notion. The node Verb and 
the node prefer are both left-comers of VP. 


Consider the parse tree for a VP shown in Figure 10.9. If we visualize 
the parse tree for this VP as a triangle with the words along the bottom, 
the word prefer lies at the lower left-comer of the tree. Formally, we can 
say that for non-terminals A and B, B is a left-comer of A if the following 
relation holds: 

A => Ba 

In other words, B can be a left-comer of A if there is a derivation of A that 
begins with a B. 

We return to our example sentence Does this flight include a meal? 
The grammar in Figure 10.2 provides us with three rules that can be used to 
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expand the category S : 

S NPVP 

S AuxNPVP 

S VP 

Using the left-comer notion, it is easy to see that only the S —> Awe NP VP 
rule is a viable candidate since the word Does can not serve as the left-comer 
of either the NP or the VP required by the other two S rules. Knowing this, 
the parser should concentrate on the Aux NP VP rule, without first construct¬ 
ing and backtracking out of the others, as it did with the non-filtering exam¬ 
ple shown in Figure 10.7. 

The information needed to efficiently implement such a filter can be 
compiled in the form of a table that lists all the valid left-comer categories 
for each non-terminal in the grammar. When a rule is considered, the table 
entry for the category that starts the right hand side of the rule is consulted. If 
it fails to contain any of the parts-of-speech associated with the current input 
then the rule is eliminated from consideration. The following table shows 
the left-comer table for Grammar 10.2. 


Category 

Left Comers 

s 

NP 

Nominal 

VP 

Det, Proper-Noun, Aux, Verb 
Det, Proper-Noun 

Noun 

Verb 


Using this left-comer table as a filter in the parsing algorithm of Figure 10.6 
is left as Exercise 10.1 for the reader. 


10.3 Problems with the Basic Top-down Parser 


Even augmented with bottom-up filtering, the top-down parser in Figure 10.7 
has three problems that make it an insufficient solution to the general-purpose 
parsing problem. These three problems are left-recursion, ambiguity, and 
inefficient reparsing of subtrees. After exploring the nature of these three 
problems, we will introduce the Earley algorithm which is able to avoid all 
of them. 
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Left-Recursion 

Depth-first search has a well-known flaw when exploring an infinite search 

space: it may dive down an infinitely-deeper path and never return to visit 

the unexpanded states. This problem manifests itself in top-down, depth- 

first, left-to-right parsers when left-recursive grammars are used. Formally, recursive 

a grammar is left-recursive if it contains at least one non-terminal A, such 

that A ocAp, for some a and p and a => e. In other words, a grammar 

is left-recursive if it contains a non-terminal category that has a derivation 

that includes itself anywhere along its leftmost branch. The grammar of 

Chapter 9 had just such a left-recursive example, in the rules for possessive 

NPs like Atlanta's airport : 

NP —> Det Nominal 

Det -»• NP ' s 

These rules introduce left-recursion into the grammar since there is a deriva¬ 
tion for the first element of the NP, the Det , that has an NP as its first con¬ 
stituent. 

A more obvious and common case of left-recursion in natural language 
grammars involves immediately left-recursive rules. These are rules of the recursive 
form A — y A P, where the first constituent of the right hand side is identi¬ 
cal to the left hand side. The following are some of the immediately left- 
recursive rules that make frequent appearances in grammars of English. 

NP NPPP 

VP VPPP 

S SandS 

A left-recursive non-terminal can lead a top-down, depth-first left-to- 
right parser to recursively expand the same non-terminal over again in ex¬ 
actly the same way, leading to an infinite expansion of trees. 

Figure 10.10 shows the kind of expansion that accompanies the addi¬ 
tion of the NP —y NP PP rule as the first NP rule in our small grammar. 

There are two reasonable methods for dealing with left-recursion in a 
backtracking top-down parser: rewriting the grammar, and explicitly man¬ 
aging the depth of the search during parsing. Recall from Chapter 9, that 
it is often possible to rewrite the rules of a grammar into a weakly equiva¬ 
lent new grammar that still accepts exactly the same language as the origi¬ 
nal grammar. It is possible to eliminate left-recursion from certain common 
classes of grammars by rewriting a left-recursive grammar into a weakly 
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equivalent non-left-recursive one. The intuition is to rewrite each rule of the 
form A —> A (3 according to the following schema, using a new symbol A’: 


A A |3 | a 


A a A 9 

A’ |3A’| 8 


This transformation changes the left-recursion to a right-recursion, and changes 
the trees that result from these rules from left-branching structures to a right¬ 
branching ones. Unfortunately, rewriting grammars in this way has a major 
disadvantage: a rewritten phrase-structure rule may no longer be the most 
grammatically natural way to represent a particular syntactic structure. Fur¬ 
thermore, as we will see in Chapter 15, this rewriting may make semantic 
interpretation quite difficult. 


Ambiguity 

One morning I shot an elephant in my pajamas. How he got into 

my pajamas I don 7 know. 

Groucho Marx, Animal Crackers , 1930 

The second problem with the top-down parser of Figure 10.6 is that it is 
ambiguity not efficient at handling ambiguity. Chapter 8 introduced the idea of lexical 
category ambiguity (words which may have more than one part of speech) 
and disambiguation (choosing the correct part of speech for a word). 

In this section we introduce a new kind of ambiguity, which arises in 
the syntactic structures used in parsing, called structural ambiguity. Struc¬ 
tural ambiguity occurs when the grammar assigns more than one possible 
parse to a sentence. Groucho Marx’s well-known line as Captain Spaulding 
for the wavfile) is ambiguous because the phrase in my pajamas can be part 
of the NP headed by elephant or the verb-phrase headed by shot. 
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I shot an elephant in my pajamas I shot an elephant in my pajamas 


Figure 10.11 Two parse trees for an ambiguous sentence. Parse (a) corre¬ 
sponds to the humorous reading in which the elephant is in the pajamas, parse 
(b) to the reading in which Captain Spaulding did the shooting in his pajamas. 


Structural ambiguity, appropriately enough, comes in many forms. Three 
particularly common kinds of ambiguity are attachment ambiguity, coor¬ 
dination ambiguity, and noun-phrase bracketing ambiguity. 

A sentence has an attachment ambiguity if a particular constituent can 
be attached to the parse tree at more than one place. The Groucho Marx 
sentence above is an example of PP-attachment ambiguity. Various kinds 
of adverbial phrases are also subject to this kind of ambiguity. For example 
in the following example the gerundive-VP flying to New York can be part 
of a gerundive sentence whose subject is the Grand Canyon or it can be an 
adjunct modifying the VP headed by saw : 

(10.3) I saw the Grand Canyon flying to New York. 

In a similar kind of ambiguity, the sentence “Can you book TWA 
flights” is ambiguous between a reading meaning ‘Can you book flights 
on behalf of TWA’, and the other meaning ‘Can you book flights run by 
TWA’). Here either one NP is attached to another to form a complex NP 
(TWA flights ), or both NPs are distinct daughters of the verb phrase. Fig¬ 
ure 10.12 shows both parses. 

Another common kind of ambiguity is coordination ambiguity, in 
which there are different sets of phrases that can be conjoined by a conjunc¬ 
tion like and. For example, the phrase old men and women can be bracketed 
[old [men and women] 7, referring to old men and old women , or as [old men] 
and [women 7 , in which case it is only the men who are old. 
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Nom 
PNoun Noun 

can you book TWA flights can you book TWA flights 

Figure 10.12 Two parse trees for an ambiguous sentence. Parse (a) corre¬ 
sponds to the meaning ‘Can you book flights on behalf of TWA?’, parse (b) to 
‘Can you book flights which are run by TWA. 


These ambiguities all combine in complex ways. A program that sum¬ 
marized the news, for example, would need to be able to parse sentences like 
the following from the Brown corpus : 

(10.4) President Kennedy today pushed aside other White House business 
to devote all his time and attention to working on the Berlin crisis 
address he will deliver tomorrow night to the American people over 
nationwide television and radio. 

This sentence has a number of ambiguities, although since they are se¬ 
mantically unreasonable, it requires a careful reading to see them. The last 
noun phrase could be parsed [nationwide [television and radio]] or [[na¬ 
tionwide television] and radio]. The direct object of pushed aside should 
be other White House business but could also be the bizarre phrase [other 
White House business to devote all his time and attention to working] (i.e. 
a structure like Kennedy denied [his intention to propose a new budget to 
address the deficit]). Then the phrase on the Berlin crisis address he will 
deliver tomorrow night to the American people could be an adjunct modi¬ 
fying the verb pushed. The PP over nationwide television and radio could 
be attached to any of the higher VPs or NPs (for example it could modify 
people or night). 

The fact that there are many unreasonable parses for a sentence is an 
extremely irksome problem that affects all parsers. In practice, parsing a 
disambigue sentence thus requires disambiguation: choosing the correct parse from a 
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multitude of possible parsers. Disambiguation algorithms generally require 
both statistical and semantic knowledge, so they will be introduced later, in 
Chapter 12 and Chapter 17. 

Parsers which do not incorporate disambiguators must simply return 
all the possible parse trees for a given input. Since the top-down parser of 
Figure 10.7 only returns the first parse it finds, it would thus need to be 
modified to return all the possible parses. The algorithm would be changed 
to collect each parse as it is found and continue looking for more parses. 
When the search space has been exhausted, the list of all the trees found is 
returned. Subsequent processing or a human analyst can then decide which 
of the returned parses is correct. 

Unfortunately, we almost certainly do not want all possible parses from 
the robust, highly ambiguous, wide-coverage grammars used in practical ap¬ 
plications. The reason for this lies in the potentially exponential number of 
parses that are possible for certain inputs. Consider the ATIS example (10.5): 

(10.5) Show me the meal on Flight UA 386 from San Francisco to Denver. 



When our extremely small grammar is augmented with the recursive VP —> 
VP PP and NP NP PP rules introduced above, the three prepositional 
phrases at the end of this sentence conspire to yield a total of 14 parse trees 
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LOCAL 

AMBIGUITY 


for this sentence. For example from San Francisco could be part of the VP 
headed by show (which would have the bizarre interpretation that the show¬ 
ing was happening from San Francisco). 

Church and Patil (1982) showed that the number of parses for sen¬ 
tences of this type grows at the same rate as the number of parenthesiza- 
tions of arithmetic expressions. Such parenthesization problems, in turn, are 
known to grow exponentially in accordance with what are called the Catalan 
numbers: 


C(n) 


1 

n+ 1 


2 n\ 
n ) 


The following table shows the number of parses for a simple noun¬ 
phrase as a function of the number of trailing prepositional phrases. As can 
be seen, this kind of ambiguity can very quickly make it imprudent to keep 
every possible parse around. 


Number of 
PPs 

Number of 
NP Parses 

2 

2 

3 

5 

4 

14 

5 

132 

6 

469 

7 

1430 

8 

4867 


There are two basic ways out of this dilemma: using dynamic pro¬ 
gramming to exploit regularities in the search space so that common sub¬ 
parts are derived only once, thus reducing some of the costs associated with 
ambiguity, and augmenting the parser’s search strategy with heuristics that 
guide it toward likely parses first. The dynamic programming approach will 
be explored in the next section, while the heuristic search strategies will be 
covered in Chapter 12. 

Even if a sentence isn’t ambiguous, it can be inefficient to parse due 
to local ambiguity. Local ambiguity occurs when some part of a sentence 
is ambiguous, i.e. has more than parse, even if the whole sentence is not 
ambiguous. For example the sentence Book that flight is unambiguous, but 
when the parser sees the first word Book , it cannot know if it is a verb or 
a noun until later. Thus it must use backtracking or parallelism to consider 
both possible parses. 
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Repeated Parsing of Subtrees 

The ambiguity problem is related to another inefficiency of the top-down 
parser of Section 10.2. The parser often builds valid trees for portions of 
the input, then discards them during backtracking, only to find that it has to 
rebuild them again. Consider the process involved in finding a parse for the 
NP in (10.6): 

(10.6) a flight from Indianapolis to Houston on TWA 

The preferred parse, which is also the one found first by the parser presented 
in Section 10.2, is shown as the bottom tree in Figure 10.14. While there are 
5 distinct parses of this phrase, we will focus here on the ridiculous amount 
repeated work involved in retrieving this single parse. 

Because of the way the rules are consulted in our top-down, depth- 
first, left-to-right approach, the parser is led first to small parse trees that fail 
because they do not cover all of the input. These successive failures trigger 
backtracking events which lead to parses that incrementally cover more and 
more of the input. The sequence of trees attempted by our top-down parser 
is shown in Figure 10.14. 

This figure clearly illustrates the kind of silly reduplication of work 
that arises in backtracking approaches. Except for its topmost component, 
every part of the final tree is derived more than once. The following table 
shows the number of times that each of the major constituents in the final tree 
is derived. The work done on this example would, of course, be magnified 
by any backtracking caused by the verb phrase or sentential level. Note, that 
although this example is specific to top-down parsing, similar examples of 
wasted effort exist for bottom-up parsing as well. 


a flight 4 

from Indianapolis 3 

to Houston 2 

on TWA 1 

a flight from Indianapolis 3 

a flight from Indianapolis to Houston 2 


a flight from Indianapolis to Houston on TWA 1 
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Figure 10.14 Reduplicated effort caused by backtracking in top-down pars- 
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10.4 The Earley Algorithm 


The previous section presented three kinds of problems that afflict standard 
bottom-up or top-down parsers, even when they have been augmented with 
filtering and other improvements: left-recursive rules, ambiguity, and inef¬ 
ficient reparsing of subtrees. Luckily, there is a single class of algorithms 
which can solve all these problems. Dynamic programming once again prmiram- 
provides a framework for solving this problem, just as it helped us with the 
Minimum Edit Distance, Yiterbi, and Forward algorithms. Recall that dy¬ 
namic programming approaches systematically fill in tables of solutions to 
sub-problems. When complete, the tables contain the solution to all the sub¬ 
problems needed to solve the problem as a whole. In the case of parsing, 
such a table is used to store subtrees for each of the various constituents in 
the input as they are discovered. The efficiency gain arises from the fact that 
these subtrees are discovered once, stored, and then used in all parses calling 
for that constituent. This solves the reparsing problem (subtrees are looked 
up, not re-parsed) and the ambiguity problem (the parsing table implicitly 
stores all possible parses by storing all the constituents with links that enable 
the parses to be reconstructed). Furthermore, dynamic programming parsing 
algorithms also solve the problem of left-recursion. As we discussed ear¬ 
lier, there are three well-known dynamic programming parsers: the Cocke- 
Younger-Kasami (CYK) algorithm which we will present in Chapter 12, the 
Graham-Harrison-Ruzzo (GHR) (Graham et ah, 1980) algorithm and the 
Earley algorithm (Earley, 1970) which we will introduce in the remainder of 
this chapter. 

The Earley algorithm (Earley, 1970) uses a dynamic programming ap¬ 
proach to efficiently implement a parallel top-down search of the kind dis¬ 
cussed in Section 10.1. As with many dynamic programming solutions, this 
algorithm reduces an apparently exponential-time problem to a polynomial¬ 
time one by eliminating the repetitive solution of sub-problems inherent in 
backtracking approaches. In this case, the dynamic programming approach 
leads to a worst-case behavior of 0(N 3 ), where N is the number of words in 
the input. 

The core of the Earley algorithm is a single left-to-right pass that fills 
an array called a chart that has N + 1 entries. For each word position in chart 
the sentence, the chart contains a list of states representing the partial parse 
trees that have been generated so far. By the end of the sentence, the chart 
compactly encodes all the possible parses of the input. Each possible subtree 
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is represented only once and can thus be shared by all the parses that need it. 

The individual states contained within each chart entry contain three 
kinds of information: a subtree corresponding to a single grammar rule, in¬ 
formation about the progress made in completing this subtree, and the posi¬ 
tion of the subtree with respect to the input. Graphically, we will use a dot 
within the right hand side of a state’s grammar rule to indicate the progress 
dotted rule made in recognizing it. The resulting structure is called a dotted rule. A 
state’s position with respect to the input will be represented by two numbers 
indicating where the state begins and where its dot lies. Consider the follow¬ 
ing three example states, which would be among those created by the Earley 
algorithm in the course of parsing (10.7): 

(10.7) Book that flight, (same as (10.1).) 

S • VP, [0,0] 

NP Det•Nominal, [1,2] 

VP -► VNP*, [0,3] 

The first state, with its dot to the left of its constituent, represents a top- 
down prediction for this particular kind of S. The first 0 indicates that the 
constituent predicted by this state should begin at the start of the input; the 
second 0 reflects the fact that the dot lies at the beginning as well. The second 
state, created at a later stage in the processing of this sentence, indicates that 
an NP begins at position 1, that a Det has been successfully parsed and that 
a Nominal is expected next. The third state, with its dot to the right of all its 
two constituents, represents the successful discovery of a tree corresponding 
to a VP that spans the entire input. These states can also be represented 
graphically, in which the states of the parse are edges, or arcs, and the chart 
as a whole is a directed acyclic graph, as in Figure 10.15. 

The fundamental operation of an Earley parser is to march through the 
A+ 1 sets of states in the chart in a left-to-right fashion, processing the states 
within each set in order. At each step, one of the three operators described 
below is applied to each state depending on its status. In each case, this 
results in the addition of new states to the end of either the current or next set 
of states in the chart. The algorithm always moves forward through the chart 
making additions as it goes; states are never removed and the algorithm never 
backtracks to a previous chart entry once it has moved on. The presence of 
a state S —> a*, [0, A] in the list of states in the last chart entry indicates a 
successful parse. Figure 10.16 gives the complete algorithm. 
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The following three sections describe in detail the three operators used 
to process states in the chart. Each takes a single state as input and derives 
new states from it. These new states are then added to the chart as long 
as they are not already present. The Predictor and the Completer add 
states to the chart entry being processed, while the Scanner adds a state to 
the next chart entry. 

Predictor 

As might be guessed from its name, the job of the Predictor is to create 
new states representing top-down expectations generated during the parsing 
process. The Predictor is applied to any state that has a non-terminal to 
the right of the dot that is not a part-of-speech category. This application 
results in the creation of one new state for each alternative expansion of that 
non-terminal provided by the grammar. These new states are placed into the 
same chart entry as the generating state. They begin and end at the point in 
the input where the generating state ends. 

For example, applying the Predictor to the state S —> • VP, [0,0] 
results in adding the states VP —> • Verb, [0,0] and VP —> • Verb NP , [0,0] to 
the first chart entry. 

Scanner 

When a state has a part-of-speech category to the right of the dot, the Scan¬ 
ner is called to examine the input and incorporate a state corresponding 
to the predicted part-of-speech into the chart. This is accomplished by cre¬ 
ating a new state from the input state with the dot advanced over the pre¬ 
dicted input category. Note that the Earley parser thus uses top-down input 
to help disambiguate part-of-speech ambiguities; only those parts-of-speech 
of a word that are predicted by some state will find their way into the chart. 
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function Earley-Parse {words, grammar) returns chart 

Enqueue((y ->■ mS, [0,0]),chart[0]) 
for i<- from 0 to Length (wo rds) do 
for each state in chart{i\ do 
if Incomplete?^/*?) and 

NEXT-CAT(.state) is not a part of speech then 
PREDICTORCtfate) 
elseif Incomplete?^/*?) and 

NEXT-CAT(5/a/e) is a part of speech then 
SCANNERCsfate) 

else 

COMPLETER(>/a/*?) 

end 

end 

return {chart) 

procedure Predictor((A ->■ a • B (3, [ij])) 

for each (B ->• y) in Grammar-Rules-For(£, grammar) do 
Enqueue((B -4 • Y, [j,j]),chart\j]) 

end 

procedure Sc anner((A ->■ a • B (3, [/,;])) 
if B c PARTS-OF-SPEECH(word//7) then 

Enqueue((£ ->* word[j\, [/, j + 1 ]), chart[j+1]) 

procedure Completer((£ ->■ y •, [j\k])) 
for each (A — > a • B (3, [ij]) in chart\J] do 
Enqueue((A -► a5#3, [i,k]%chart[k ]) 

end 

procedure Enqueue (state, chart-entry) 
if state is not already in chart-entry then 
Push (state, chart-entry) 

end 


Figure 10.16 The Earley algorithm. 


Returning to our example, when the state VP —> •Verb NP , [0,0] is pro¬ 
cessed, the Scanner consults the current word in the input since the cat¬ 
egory following the dot is a part-of-speech. The Scanner then notes that 
book can be a verb, matching the expectation in the current state. This results 
in the creation of the new state VP —> VeiiyNP, [0,1]. The new state is then 
added to the chart entry that follows the one currently being processed. 
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Completer 

The Completer is applied to a state when its dot has reached the right 
end of the rule. Intuitively, the presence of such a state represents the fact 
that the parser has successfully discovered a particular grammatical category 
over some span of the input. The purpose of the Completer is to find and 
advance all previously created states that were looking for this grammatical 
category at this position in the input. New states are then created by copying 
the older state, advancing the dot over the expected category and installing 
the new state in the current chart entry. 

For example, when the state NP DetNominal*, [1,3] is processed, 
the Completer looks for states ending at 1 expecting an NP. In the current 
example, it will find the state VP Verb*NP , [0,1] created by the Scanner. 
This results in the addition of a new complete state VP —> Verb NP •, [0,3]. 

An Example 

Figure 10.17 shows the sequence of states created during the complete pro¬ 
cessing of Example 10.1/10.7. The algorithm begins by seeding the chart 
with a top-down expectation for an S. This is accomplished by adding a 
dummy state y —> • S, [0,0] to Chart[0]. When this state is processed, it is 
passed to the Predictor leading to the creation of the three states repre¬ 
senting predictions for each possible type of S', and transitively to states for 
all of the left comers of those trees. When the state VP —> • Verb, [0,0] is 
processed, the Scanner is called and the first word is consulted. A state 
representing the verb sense of Book is then added to the entry for Chart[l]. 
Note that when the state VP —> *VNP , [0,0] is processed, the Scanner is 
called again. However, this time a new state is not added since it would be 
identical to the one already in the chart. Note also that since this admittedly 
deficient grammar generates no predictions for the Noun sense of Book , no 
entries will be made for it in the chart. 

When all the states of Chart[0] have been processed, the algorithm 
moves on to Chart[l] where it finds the state representing the verb sense 
of book. This is a complete state with its dot to the right of its constituent 
and is therefore passed to the Completer. The Completer then finds 
the two previously existing VP states expecting a Verb at this point in the 
input. These states are copied with their dots advanced and added to the 
Chart[l]. The completed state corresponding to an intransitive VP leads to 
the creation of the imperative S state. Alternatively, the dot in the transitive 
verb phrase leads to the creation of the two states predicting NPs. Finally, 
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Chart[0] 


y ->■ *S 

[0,0] Dummy start state 

S *NPVP 

[0,0] 

Predictor 

NP *Det NOMINAL 

[0,0] 

Predictor 

NP —> * Proper-Noun 

[0,0] 

Predictor 

S • Aux NP VP 

[0,0] 

Predictor 

S -> *VP 

[0,0] 

Predictor 

VP * Verb 

[0,0] 

Predictor 

VP * Verb NP 

[0,0] 

Predictor 


Chart[l] 

Verb —> book* 

[0,1] Scanner 

VP -> 

Verb* 

[0,1] Completer 

S VP* 

[0,1] Completer 

VP 

Verb * NP 

[0,1] Completer 

NP -> 

* Det NOMINAL 

[1,1] Predictor 

NP 

* Proper-Noun 

[1,1] Predictor 


Chart[2] 

Det —^ that* 

[1,2] 

Scanner 

NP Det*NOMINAL 

[1,2] Completer 

NOMINAL * Noun 

[2,2] 

Predictor 

NOMINAL * Noun NOMINAL [2,2] 

Predictor 

Chart[3] 




Noun —> flight* 

NOMINAL Noum 
NOMINAL -> Noum NOMINAL 
NP -> DetNOMINAL* 

VP Verb NP * 

S VP* 

NOMINAL * Noun 
NOMINAL * Noun NOMINAL 


[2,3] Scanner 
[2,3] Completer 

[2.3] Completer 

[1.3] Completer 
[0,3] Completer 
[0,3] Completer 

[3.3] Predictor 
[3,3] Predictor 


Figure 10.17 Sequence of states created in chart while parsing Book that 
flight. Each entry shows the state, its start and end points, and the Earley 
function that placed it in the chart. 
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the state NP • Det Nominal, [1,1] causes the Scanner to consult the word 
that and add a corresponding state to Chart[2]. 

Moving on to Chart[2j, the algorithm finds the state representing the 
determiner sense of that. This complete state leads to the advancement of 
the dot in the NP state predicted in Chart[l], and also to the predictions for 
the various kinds of Nominal. The first of these causes the Scanner to be 
called for the last time to process the word flight. 

Moving on to Chart[3], the presence of the state representing flight 
leads in quick succession to the completion of an NP , transitive VP , and an 
S. The presence of the state S —> VPm, [0,3] in the last chart entry signals the 
discovery of a successful parse. 


Retrieving Parse Trees from a Chart 


The version of the Earley algorithm just described is actually a recognizer not 
a parser. After processing, valid sentences will leave the state S —> a*, [0,A] 
in the chart. Unfortunately, as it stands we have no way of retrieving the 
structure of this S. To turn this algorithm into a parser, we must be able to 
extract individual parses from the chart. To do this, the representation of 
each state must be augmented with an additional field to store information 
about the completed states that generated its constituents. 

This information can be gathered by making a simple change to the 
Completer. Recall that the Completer creates new states by advancing 
older incomplete ones when the constituent following the dot is discovered. 
The only change necessary is to have Completer add a pointer to the older 
state onto the list of previous-states of the new state. Retrieving a parse tree 
from the chart is then merely a recursive retrieval starting with the state (or 
states) representing a complete S in the final chart entry. Figure 10.18 shows 
the chart produced by an appropriately updated Completer. 

If there are an exponential number of trees for a given sentence, the 
Earley algorithm can not magically return them all in a polynomial amount 
of time. The best it can do is build the chart in polynomial time. Figure 10.19 
illustrates a portion of the chart from Figure 10.17 using the directed graph 
notation. Note that since large charts in this format can get rather confusing, 
this figure only includes the states that play a role in the final parse. 
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Chart[0] 


SO 

y • S 

[0,0] 

[] 

Dummy start state 

SI 

S ^ • NP VP 

[0,0] 

[] 

Predictor 

S2 

NP •DetNOMINAL 

[0,0] 

[] 

Predictor 

S3 

NP —> • Proper-Noun 

[0,0] 

[] 

Predictor 

S4 

S -)• •AuxNPVP 

[0,0] 

[] 

Predictor 

S5 

S ^ •VP 

[0,0] 

[] 

Predictor 

S6 

VP -> • Verb 

[0,0] 

[] 

Predictor 

S7 

VP -y • Verb NP 

[0,0] 

[] 

Predictor 


Chart[l] 


58 Verb —Y boolf 

59 VP Verb• 

510 S -> VP• 

511 VP Verb • NP 


[0,1] [] Scanner 

[0,1] [S8] Completer 
[0,1] [S9] Completer 
[0,1] [S8] Completer 


512 NP • Det NOMINAL [1,1] [] Predictor 

513 NP —> •Proper-Noun [1,1] [] Predictor 


_ Chart[2] _ 

514 Det ->■ that • [1,2] [] Scanner 

515 NP -> Det •NOMINAL [1,2] [S14] Completer 

516 NOMINAL •Noun [2,2] [] Predictor 

517 NOMINAL •NounNOMINAL [2,2] [] Predictor 


Chart[3] 


S18 

Noun —> flight 

[2,3] 

[] 

Scanner 

S19 

NOMINAL Noum 

[2,3] 

[SI 8] 

Completer 

S20 

NOMINAL Noum NOMINAL 

[2,3] 

[S18] 

Completer 

S21 

NP Det NOMINAL • 

[1,3] 

[S14.S19] Completer 

S22 

VP ->- Verb NP • 

[0,3] 

[S8,S21] 

Completer 

S23 

S VP • 

[0,3] 

[S22] 

Completer 

S24 

NOMINAL • Noun 

[3,3] 

[] 

Predictor 

S25 

NOMINAL •NounNOMINAL 

[3,3] 

[] 

Predictor 


Figure 10.18 Sequence of states created in chart while parsing Book that 
flight including structural information. 
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S23 



Figure 10.19 A portion of the chart shown in Figure 10.17 displayed in a 
directed acyclic graph notation. 


10.5 Finite-State Parsing Methods 


Some language-processing tasks don’t require complete parses. For these 
tasks, a partial parse or shallow parse of the input sentence may be suf¬ 
ficient. For example, information extraction algorithms generally do not 
extract all the possible information in a text; they simply extract enough to 
fill out some sort of template of required data. Many partial parsing sys¬ 
tems use cascades of finite-state automata instead of context-free grammars. 
Relying on simple finite-state automata rather than full parsing makes such 
systems extremely efficient. Since finite-state systems cannot model certain 
kinds of recursive rules, however, they trade this efficiency for a certain lack 
of coverage. We will discuss information extraction in Chapter 15; here we 
just show how finite-state automata can be used to recognize basic phrases, 
such as noun groups, verb groups, locations, etc. Here’s the output of the 
FASTUS basic phrase identifier; of course the choice of which basic phrases 
to produce can be dependent on the application: 


SHALLOW 

PARSE 


CASCADES 


BASIC 

PHRASES 
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Company Name: 

Bridgestone Sports Co. 

Verb Group: 

said 

Noun Group: 

Friday 

Noun Group: 

it 

Verb Group: 

had set up 

Noun Group: 

a joint venture 

Preposition: 

in 

Location: 

Taiwan 

Preposition: 

with 

Noun Group: 

a local concern 

Conjunction: 

and 

Noun Group: 

a Japanese trading house 

Verb Group: 

to produce 

Noun Group: 

golf clubs 

Verb Group: 

to be shipped 

Preposition: 

to 

Location: 

Japan 


These basic phrases are produced by a collection finite-state rules com¬ 
piled into a transducer. To give a feel for how this works, we’ll give a simpli¬ 
fied set of the FASTUS rules from Appelt and Israel (1997) used to build the 
automaton to detect noun groups. A noun group is like the core of a noun 
phrase; it consists of the head noun and the modifiers to the left (determiner, 
adjectives, quantifiers, numbers, etc). 

A noun-group can consist of just a pronoun she , him , them or a time- 
phrase yesterday , or a date: 

NG —> Pronoun | Time-NP | Date-NP 

It can also consist of certain determiners that can stand alone (this, 
that); or a head noun (HdNns) preceded by optional determiner phrase ( DETP ) 
and/or optional adjectives (Adjs) (the quick and dirty solution , the frustrat¬ 
ing mathematics problem) or a head noun modified by a gerund phrase (the 
rising index): 

NG -> (DETP) (Adjs) HdNns | DETP Ving HdNns | DETP-CP (and HdNns) 

The parentheses above are used to indicate optional elements, while 
braces are used just for grouping. Determiner-phrases come in two varieties: 

DETP DETP-CP | DETP-INCP 

Complete determiner-phrases (DETP-CP) are those which can stand 
alone as an NP, such as the only five , another three , this , many , hers , all , 
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and the most. Adv-pre-num are adverbs that can appear before a number in 
the determiner (,almost 5, precisely 5), while Pro-Poss-cp are possessive pro¬ 
nouns that can stand on their own as complete NPs (mine, his). Quantifiers 
(Q) include many, few, much, etc. 

DETP-CP —v ( { Adv-pre-num | “another” | 

{ Det | Pro-Poss } ({Adv-pre-num | “only” (“other”)})}) Number 
| Q | Q-er | (“the”) Q-est | “another” | Det-cp | DetQ | Pro-Poss-cp 

Incomplete determiner-phrases ( DETP-INCP ) are those which cannot 
act as NPs alone, for example the, his only, every, a. Pro-Poss-incomp are 
possessive pronouns which cannot stand on their own as a complete NP (e.g. 
my, her): 

DETP-INCP -A { { { Det | Pro-Poss } “only” 

| “a” | “an” 

| Det-incomp 

| Pro-Poss-incomp } (“other”) 

| (DET-CP) “other”} 

An adjective sequence (Adjs) consists of one or more adjectives or par¬ 
ticiples separated by commas and/or conjunctions (e.g. big, bad , and ugly, 
or interesting but outdated ): 

Adjs -A AdjP ( { “,” | (“ ”) Conj } { AdjP | Vparticiple}) * 

Adjective phrases can be made of adjectives, participles, ordinal num¬ 
bers, and noun-verb combinations, like man-eating, and can be modified 
by comparative and superlative quantifiers (Q-er: more, fewer; Q-est: most, 
fewest). This rule-set chooses to disallow participles as the first word in 
adjective-phrases or noun groups, to avoid incorrectly taking many Verb- 
Object combinations as noun groups. 

AdjP —> Ordinal 

| ({Q-er | Q-est} { Adj | Vparticiple } + 
j { N[sing,!Time-NP] (“-”) { Vparticiple } 
j Number (“-”) { “month” | “day” | “year”} (“-”) “old”} 

Nouns can be conjoined ( cats and dogs): 

HdNns -A HdNn (“and” HdNn) 

Finally, we need to deal with noun-noun compounds and other noun¬ 
like pre-modifiers of nouns, in order to cover head noun groups like gasoline 
and oil tanks, California wines, Clinton, and quick-reaction strike: 
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HdNn —> PropN 

| { PreNs | PropN PreNs} N[!Time-NP] 
j { PropN CommonN[!Time-NP] } 

Noun modifiers of nouns can be conjoined (gasoline and oil ) or created 
via dash (quick-reaction). Adj-noun-like refers to adjectives that can appear 
in the position of a prenominal noun (e.g. presidential retreat): 

PreNs —> PreN (“and” PreN2) * 
preN ->■ (Adj Common-Sing-N 
preN2 —> PreN | Ordinal | Adj-noun-like 

Figure 10.20 shows an FSA for the Adjs portion of the noun-group 
recognizer, and an FSA for the AdjP portion. 


8 



Figure 10.20 A portion of an FSA grammar, covering conjoined adjective 
phrases. In a real automaton, each AdjP node would actually be expanded with 
a copy of iht AdjP automaton shown in Figure 10.21 





Figure 10.21 A portion of an FSA grammar, covering the internal details 
of adjective phrases. 


The pieces of automata in Figure 10.20 and Figure 10.21 can then 
be combined into a single large Noun-Group-Recognizer by starting with 
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the NG automaton and iteratively expanding out each reference to another 
rule/automaton. This is only possible because none of these references are 
recursive; that is, because the expansion of AdjP doesn’t refer to AdjP. 

Page 345, however, showed that a more complete grammar of English 
requires this kind of recursion. Recall, for example, that a complete defini¬ 
tion of NP needs to refer to other NPs in the rules for relative clauses and 
other post-nominal modifiers. 

One way to handle recursion is by allowing only a limited amount 
of recursion; this is what FASTUS does, by using its automata cascade. 

The second level of FASTUS finds non-recursive noun groups; the third 
level combines these groups into larger AP-like units by adding on measure 
phrases: 

20,000 iron and “metal wood” clubs a month, 
attaching preposition phrases: 

production of 20,000 iron and “metal wood” clubs a month, 
and dealing with noun group conjunction: 

a local concern and a Japanese trading house 

In a single level system, each of these phenomena would require recur¬ 
sive rules (e.g. NP —> NP and NP). By splitting the parsing into two levels, 
FASTUS essentially treats the NP on the left-hand side as a different kind of 
object from the NPs on the right-hand side. 

A second method for dealing with recursion is to use a model which 
looks finite-state but isn’t. One such model is the Recursive Transition 
Network or RTN. An RTN is defined by a set of graphs like those in Fig- rtn 
ure 10.20 and Figure 10.21, in which each arc contains a terminal or non¬ 
terminal node. The difference between an RTN and an FSA lies in how the 
non-terminals are handled. In an RTN, every time the machine comes to an 
arc labeled with a non-terminal, it treats that non-terminal as a sub-routine. 

It places its current location onto a stack, jumps to the non-terminal, and 
then jumps back when that non-terminal has been parsed. If a rule for NP 
contains a self-reference, the RTN once again puts the current location on a 
stack and jumps back to the beginning of the NP. 

Since an RTN is exactly equivalent to a context-free grammar, travers¬ 
ing an RTN can thus be thought of as a graphical way to view a simple 
top-down parser for context-free rules. RTNs are most often used as a con¬ 
venient graphical metaphor when displaying or describing grammars, or as 
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a way to implement a system which has a small amount of recursion but is 
otherwise finite-state. 


10.6 Summary 

This chapter introduced a lot of material. The most important two ideas are 
those of parsing and partial parsing. Here’s a summary of the main points 
we covered about these ideas: 

• Parsing can be viewed as a search problem. 

• Two common architectural metaphors for this search are top-down 
(starting with the root S and growing trees down to the input words) 
and bottom-up (staring with the words and growing trees up toward 
the root S ). 

• One simple parsing algorithm is the top-down depth-first left-to-right 
parser of Figure 10.6 on page 362. 

• Top down parsers can be made more efficient by using a left-comer 
table to only suggest non-terminals which are compatible with the in¬ 
put. 

• Ambiguity, left-recursion, and repeated parsing of sub-trees all pose 
problems for this simple parsing algorithm. 

• A sentence is structurally ambiguous if the grammar assigns it more 
than one possible parse. 

• Common kinds of structural ambiguity include PP-attachment, coor¬ 
dination ambiguity and noun-phrase bracketing ambiguity. 

• The dynamic programming parsing algorithms use a table of partial- 
parses to efficiently parse ambiguous sentences. The Earley algorithm 
is a top-down dynamic-programming algorithm, while the CYK algo¬ 
rithm is bottom up. 

• Certain information extraction problems can be solved without full 
parsing. These are often addressed via FSA cascades. 


Bibliographical and Historical Notes 


Writing about the history of compilers, Knuth notes: 
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In this field there has been an unusual amount of parallel discov¬ 
ery of the same technique by people working independently. 

Indeed, the problem of identifying the first appearance of various parsing 
ideas recalls Kruskal’s (1983) comment about the ‘remarkable history of 
multiple independent discovery and publication’ of dynamic programming 
algorithms for sequence comparison. This history will therefore error on 
the side of succinctness in giving only a characteristic early mention of each 
algorithm; the interested reader should see Aho and Ullman (1972). 

Bottom-up parsing seems to have been first described by Yngve (1955), 
who gave a breadth-first bottom-up parsing algorithm as part of an illustra¬ 
tion of a machine translation procedure. Top-down approaches to parsing 
and translation was described (presumably independently) by at least Glen¬ 
nie (1960), Irons (1961), and Kuno and Oettinger (1962). Dynamic pro¬ 
gramming parsing, once again, has a history of independent discovery. Ac¬ 
cording to Martin Kay (p.c.), a dynamic programming parser containing the 
roots of the CYK algorithm was first implemented by John Cocke in 1960. 

Later work extended and formalized the algorithm, as well as proving its 
time complexity (Kay, 1967; Younger, 1967; Kasami, 1965). The related 
well-formed substring table (WFST) seems to have been independently wfst 
proposed by Kuno (1965), as a data structure which stores the results of all 
previous computations in the course of the parse. Based on a generalization 
of Cocke’s work, a similar data-structure had been independently described 
by Kay (1967) and Kay (1973). The top-down application of dynamic pro¬ 
gramming to parsing was described in Earley’s Ph.D. thesis (Earley, 1968) 
and Earley (1970). Sheil (1976) showed the equivalence of the WFST and 
the Earley algorithm. Norvig (1991) shows that the efficiency offered by 
all of these dynamic programming algorithms can be captured in any lan¬ 
guage with a memoization function (such as LISP) simply by wrapping the 
memoization operation around a simple top-down parser. 

While parsing via cascades of finite-state automata had been com¬ 
mon in the early history of parsing (Harris, 1962), the focus shifted to full 
CFG parsing quite soon afterwards. Church (1980) argued for a return to 
finite-state grammars as a processing model for natural language understand¬ 
ing; Other early finite-state parsing models include Ejerhed (1988). Abney 
(1991) argued for the important practical role of shallow parsing. Much re¬ 
cent work on shallow parsing applies machine learning to the task of learning 
the patterns; see for example Ramshaw and Marcus (1995), Shlomo Arga- 
mon (1998), and Munoz et al (1999). 
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The classic reference for parsing algorithms is Aho and Ullman (1972); 
although the focus of that book is on computer languages, most of the algo¬ 
rithms have been applied to natural language. A good programming lan¬ 
guages textbook such as Aho et al (1986) is also useful. 


Exercises 

10.1 Modify the top-down parser in Figure 10.7 to add bottom-up filtering. 
You can assume the use of a left-comer table like the one on page 366. 

10.2 Write an algorithm for eliminating left-recursion based on the intu¬ 
ition on page 368. 

10.3 Implement the finite-state grammar for noun-groups described on pages 
384-387. Test it on some sample noun-phrases. If you have access to an on¬ 
line dictionary with part-of-speech information, start with that; if not, build 
a more restricted system by hand. 

10.4 Augment the Earley algorithm of Figure 10.16 to enable parse trees 
to be retrieved from the chart by modifying the pseudocode for the Com¬ 
pleter as described on page 381. 

10.5 Implement the Earley algorithm as augmented in the previous ex¬ 
ercise. of Figure 10.16. Check it on a test sentence using a baby gram¬ 
mar. 


10.6 Discuss the relative advantages and disadvantages of partial parsing 
versus full parsing. 

10.7 Discuss how you would augment a parser to deal with input that may 
be be incorrect, such as spelling errors or misrecognitions from a speech 
recognition system. 




FEATURES AND 
UNIFICATION 


Friar Francis: If either of you know any inward impediment 
why you should not be conjoinedcharge you , on your souls , 
to utter it 


William Shakespeare, Much Ado About Nothing 


From a reductionist perspective, the history of the natural sciences over the 
last few hundred years can be seen as an attempt to explain the behavior of 
larger structures by the combined action of smaller primitives. In biology, 
the properties of inheritance have been explained by the action of genes, 
and then again the properties of genes have been explained by the action of 
DNA. In physics, matter was reduced to atoms and then again to subatomic 
particles. The appeal of reductionism has not escaped computational lin¬ 
guistics. In this chapter we introduce the idea that grammatical categories 
like VPto, Sthat , Non3sgAux , or 3sgNP , as well as the grammatical rules like 
S —> NP VP that make use of them, should be thought of as objects that can 
have complex sets of properties associated with them. The information in 
these properties is represented by constraints, and so these kinds of models 
are often called constraint-based formalisms. 

Why do we need a more fine-grained way of representing and plac¬ 
ing constraints on grammatical categories? One problem arose in Chapter 9, 
where we saw that naive models of grammatical phenomena such as agree¬ 
ment and subcategorization can lead to overgeneration problems. For exam¬ 
ple, in order to avoid ungrammatical noun phrases such as this flights and 
verb phrases like disappeared a flight , we were forced to create a huge pro¬ 
liferation of primitive grammatical categories such as Non3sgVPto , NPmass , 


CONSTRAINT- 
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3sgNP and Non3sgAux. These new categories led, in turn, to an explosion in 
the number of grammar rules and a corresponding loss of generality in the 
grammar. A constraint-based representation scheme will allow us to repre¬ 
sent fine-grained information about number and person, agreement, subcat¬ 
egorization, as well as semantic categories like mass/count. 

Constraint-based formalisms have other advantages that we will not 
cover in this chapter, such as the ability to model more complex phenomena 
than context-free grammars, and the ability to efficiently and conveniently 
compute semantics for syntactic representations. 

Consider briefly how this approach might work in the case of grammat¬ 
ical number. As we saw in Chapter 9, noun phrases like this flight and those 
flights can be distinguished based on whether they are singular or plural. 
This distinction can be captured if we associate a property called number 
that can have the value singular or plural, with appropriate members of the 
NP category. Given this ability, we can say that this flight is a member of the 
NP category and, in addition, has the value singular for its number prop¬ 
erty. This same property can be used in the same way to distinguish singular 
and plural members of the VP category such as serves lunch and serve lunch. 

Of course, simply associating these properties with various words and 
phrases does not solve any of our overgeneration problems. To make these 
properties useful, we need the ability to perform simple operations, such as 
equality tests, on them. By pairing such tests with our core grammar rules, 
we can add various constraints to help ensure that only grammatical strings 
are generated by the grammar. For example, we might want to ask whether 
or not a given noun phrase and verb phrase have the same values for their 
respective number properties. Such a test is illustrated by the following kind 
of rule. 

S ^ NP VP 

Only if the number of the NP is equal to the number of the YP. 

The remainder of this chapter provides the details of one computational 
implementation of a constraint-based formalism, based on feature struc¬ 
tures and unification. The next section describes feature structures, the 
representation used to capture the kind of grammatical properties we have in 
mind. Section 11.2 then introduces the unification operator that is used to 
implement basic operations over feature structures. Section 11.3 then cov¬ 
ers the integration of these structures into a grammatical formalism. Section 
11.4 then introduces the unification algorithm and its required data struc¬ 
tures. Next, Section 11.5 describes how feature structures and the unifica- 
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tion operator can be integrated into a parser. Finally, Section 11.6 discusses 
the most significant extension to this constraint-based formalism, the use of 
types and inheritance, as well as other extensions. 

11.1 Feature Structures 

One of the simplest ways to encode the kind of properties that we have in 
mind is through the use of feature structures. These are simply sets of 
feature-value pairs, where features are unanalyzable atomic symbols drawn 
from some finite set, and values are either atomic symbols or feature struc¬ 
tures. Such feature structures are traditionally illustrated with the following 
kind of matrix-like diagram. 


FEATUREi 

VALUEi 

feature 2 

value 2 

FEATURE^ 

VALUE ra 


To be concrete, let us consider the number property discussed above. 
To capture this property, we will use the symbol number to designate this 
grammatical attribute, and the symbols SG and pl (introduced in Chapter 3) 
to designate the possible values it can take on in English. A simple feature 
structure consisting of this single feature would then be illustrated as follows. 

NUMBER SG 

Adding an additional feature-value pair to capture the grammatical notion of 
person leads to the following feature structure. 

NUMBER SG 
PERSON 3 

Next we can encode the grammatical category of the constituent that this 
structure corresponds to through the use of the cat feature. For example, 
we can indicate that these features are associated with a noun phrase by 
using the following structure. 

CAT NP 

NUMBER SG 
PERSON 3 


FEATURE 

STRUCTURES 
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FEATURE 

PATH 


This structure can be used to represent the 3sgNP category introduced in 
Chapter 9 to capture a restricted subcategory of noun phrases. The corre¬ 
sponding plural version of this structure would be captured as follows. 

CAT NP 

NUMBER PL 
PERSON 3 


Note that the value of the cat and person features remains the same for 
these last two structures. This illustrates how the use of feature structures 
allows us to both preserve the core set of grammatical categories and draw 
distinctions among members of a single category. 

As mentioned earlier in the definition of feature structures, features 
are not limited to atomic symbols as their values; they can also have other 
feature structures as their values. This is particularly useful when we wish 
to bundle a set of feature-value pairs together for similar treatment. As an 
example of this, consider that the number and person features are often 
lumped together since grammatical subjects must agree with their predicates 
in both their number and person properties. This lumping together can be 
captured by introducing an agreement feature that takes a feature struc¬ 
ture consisting of the number and person feature-value pairs as its value. 
Introducing this feature into our third person singular noun phrase yields the 
following kind of structure. 


CAT NP 


AGREEMENT 


NUMBER SG 
PERSON 3 


Given this kind of arrangement, we can test for the equality of the values for 
both the number and person features of two constituents by testing for 
the equality of their agreement features. 

This ability to use feature structures as values leads fairly directly to 
the notion of a feature path. A feature path is nothing more than a list of 
features through a feature structure leading to a particular value. For exam¬ 
ple, in the last feature structure, we can say that the (agreement number) 
path leads to the value SG, while the (agreement person) path leads to 
the value 3. This notion of a path leads naturally to an alternative graph¬ 
ical way of illustrating features structures, shown in Figure 11.1, which as 
we will see in Section 11.4 is suggestive of how they will be implemented. 
In these diagrams, feature structures are depicted as directed graphs where 
features appear as labeled edges and values as nodes. 
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Figure 11.1 A directed graph notation for feature structures. 


Although this notion of paths will prove useful in a number of set¬ 
tings, we introduce it here to help explain an additional important kind of 
feature structure: those that contain features that actually share some feature 
structure as a value. Such feature structures will be referred to as reentrant reentrant 
structures. What we have in mind here is not the simple idea that two fea¬ 
tures might have equal values, but rather that they share precisely the same 
feature structure (or node in the graph). These two cases can be distinguished 
clearly if we think in terms of paths through a graph. In the case of simple 
equality, two paths lead to distinct nodes in the graph that anchor identical, 
but distinct structures. In the case of a reentrant structure, two feature paths 
actually lead to the same node in the structure. 

Figure 11.2 illustrates a simple example of reentrancy. In this structure, 
the (head subject agreement) path and the (head agreement) path 
lead to the same location. Shared structures like this will be denoted in our 
matrix diagrams by adding numerical indexes that signal the values to be 
shared. The matrix version of the feature structure from Figure 11.2 would 
be denoted as follows, using the notation of the PATR-II system (Shieber, 

1986), based on Kay (1979): 


CAT S 


HEAD 


AGREEMENT 

SUBJECT 


NUMBER SG 

0 

PERSON 3 
AGREEMENT E 
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Figure 11.2 A feature structure with shared values. The location (value) 
found by following the (head subject agreement) path is the same as 
that found via the (head agreement) path. 


As we will see, these simple structures give us the ability to express 
linguistic generalizations in surprisingly compact and elegant ways. 


1 1.2 Unification of Feature Structures 


As noted earlier, feature structures would be of little use without our being 
able to perform reasonably efficient and powerful operations on them. As we 
will show, the two principal operations we need to perform are merging the 
information content of two structures and rejecting the merger of structures 
that are incompatible. Fortunately, a single computational technique, called 
unification unification, suffices for both of these purposes. The bulk of this section 
will illustrate through a series of examples how unification instantiates these 
notions of merger and compatibility. Discussion of the unification algorithm 
and its implementation will be deferred to Section 11.4. 

We begin with the following simple application of the unification op¬ 
erator. 


number 


sg 


U 


NUMBER 


SG 


NUMBER 


SG 


As this equation illustrates, unification is implemented as a binary operator 
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(represented here as U) that accepts two feature structures as arguments and 
returns a feature structure when it succeeds. In this example, unification is 
being used to perform a simple equality check. The unification succeeds 
because the corresponding number features in each structure agree as to 
their values. In this case, since the original structures are identical, the output 
is the same as the input. The following similar kind of check fails since the 
NUMBER features in the two structures have incompatible values. 


NUMBER 


SG 


u 


NUMBER 


PL 


Fails! 


This next unification illustrates an important aspect of the notion of 
compatibility in unification. 


NUMBER 


SG 


u 


NUMBER 


[] 


NUMBER SG 


In this situation, these features structures are taken to be compatible, and 
are hence capable of being merged, despite the fact that the given values for 
the respective number features are different. The [] value in the second 
structure indicates that the value has been left unspecified. A feature with 
such a [] value can be successfully matched to any value in a corresponding 
feature in another structure. Therefore, in this case, the value SG from the 
first structure can match the [] value from the second, and as is indicated by 
the output shown, the result of this type of unification is a structure with the 
value provided by the more specific, non-null, value. 

The next example illustrates another of the merger aspects of unifica¬ 
tion. 


NUMBER SG 

u 

PERSON 3 

= 

NUMBER 

SG 





PERSON 

3 


Here the result of the unification is a merger of the original two structures 
into one larger structure. This larger structure contains the union of all the 
information stored in each of the original structures. Although this is a sim¬ 
ple example, it is important to understand why these structures are judged to 
be compatible: they are compatible because they contain no features that are 
explicitly incompatible. The fact that they each contain a feature-value pair 
that the other does not is not a reason for the unification to fail. 

We will now consider a series of cases involving the unification of 
somewhat more complex reentrant structures. The following example illus¬ 
trates an equality check complicated by the presence of a reentrant structure 
in the first argument. 
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AGREEMENT 


NUMBER SG 
PERSON 3 


SUBJECT 


AGREEMENT CD 


U 


SUBJECT 


AGREEMENT 


PERSON 

NUMBER 


3 

SG 


AGREEMENT 


a 


NUMBER 

PERSON 


SG 

3 


SUBJECT 


AGREEMENT 0 


The important elements in this example are the subject features in the two 
input structures. The unification of these features succeeds because the val¬ 
ues found in the first argument by following the CD numerical index, match 
those that are directly present in the second argument. Note that, by itself, 
the value of the agreement feature in the first argument would have no 
bearing on the success of unification since the second argument lacks an 
agreement feature at the top level. It only becomes relevant because the 
value of the agreement feature is shared with the subject feature. 

The following example illustrates the copying capabilities of unifica¬ 
tion. 


( 11 . 1 ) 


AGREEMENT 


a 


SUBJECT 


AGREEMENT Q] 


U 


SUBJECT 


AGREEMENT 


PERSON 3 
NUMBER SG 


AGREEMENT Q] 


SUBJECT 


AGREEMENT 


a 


PERSON 

NUMBER 


3 

SG 


Here the value found via the second argument’s (subject agreement) 
feature is copied over to the corresponding place in the first argument. In 
addition, the AGREEMENT feature of the first argument receives a value as 
a side-effect of the index linking it to the end of (subject agreement) 
feature. 
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The next example demonstrates the important difference between fea¬ 
tures that actually share values versus those that merely have similar values. 

( 11 . 2 ) 


AGREEMENT 


SUBJECT 


NUMBER SG 


AGREEMENT NUMBER SG 


u 


SUBJECT 


AGREEMENT 


PERSON 

NUMBER 


3 

SG 


AGREEMENT 


NUMBER SG 


SUBJECT 


AGREEMENT 


NUMBER SG 
PERSON 3 


The values at the end of the (subject agreement) path and the 
(agreement) path are the same, but not shared, in the first argument. The 
unification of the subject features of the two arguments adds the person 
information from the second argument to the result. However, since there 
is no index linking the agreement feature to the (subject agreement) 
feature, this information is not added to the value of the agreement fea¬ 
ture. 

Finally, consider the following example of a failure to unify. 


AGREEMENT 


CD 


NUMBER 

PERSON 


SG 

3 


SUBJECT 


AGREEMENT 0 


u 


AGREEMENT 


NUMBER SG 
PERSON 3 


SUBJECT 


AGREEMENT 


NUMBER PL 
PERSON 3 


Fails! 


Proceeding through the features in order, we first find that the agreement 
features in these examples successfully match. However, when we move 
on to the subject features, we find that the values found at the end of the 
respective ( subject agreement number ) paths differ, causing a unifi¬ 
cation failure. 
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Feature structures are a way of representing partial information about 
some linguistic object or placing informational constraints on what the object 
can be. Unification can be seen as a way of merging the information in each 
feature structure, or describing objects which satisfy both sets of constraints. 
Intuitively, unifying two feature structures produces a new feature structure 
which is more specific (has more information) than, or is identical to, either 
of the input feature structures. We say that a less specific (more abstract) 
subsumes feature structure subsumes an equally or more specific one. Subsumption 
is represented by the operator □. A feature structure F subsumes a feature 
structure G (F □ G) if and only if: 


1. for every feature x in F, F(x) U G(x) . (where F(x) means ‘the value 
of the feature x of feature structure F’) 

2. for all paths p and q in F such that F(p) =F(q ), it is also the case that 
G(p) = G(q). 


For example, consider these feature structures: 


(11.3) 

NUMBER 

SG 

(11.4) 

PERSON 

3 


(11.5) 

NUMBER 

SG 


PERSON 

3 


( 11 . 6 ) 


CAT VP 

AGREEMENT CD 


SUBJECT 


AGREEMENT Q] 


(11.7) 


CAT VP 

AGREEMENT Q] 


SUBJECT 


AGREEMENT 


PERSON 3 
NUMBER SG 


( 11 . 8 ) 


CAT VP 

AGREEMENT CD 


SUBJECT 


AGREEMENT 


CD 


PERSON 

NUMBER 


3 

SG 
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The following subsumption relations hold among them: 

11.3 □ 11.5 

11.4 □ 11.5 
11.6 □ 11.7 □ 11.8 


Subsumption is a partial ordering; there are pairs of feature structures 
that neither subsume nor are subsumed by each other: 

11.3 £ 11.4 

11.4 % 11.3 

Since every feature structure is subsumed by the empty structure [], 
the relation among feature structures can be defined as a semilattice. The 
semilattice is often represented pictorially with the most general feature [] 
at the top and the subsumption relation represented by lines between feature 
structures. Unification can be defined in terms of the subsumption semilat¬ 
tice. Given two feature structures F and G, F U G is defined as the most 
general feature structure H s.t. F □ H and G C //. Since the information 
ordering defined by unification is a semilattice, the unification operation is 
monotonic (Pereira and Shieber, 1984; Rounds and Kasper, 1986; Moshier, 
1988). This means that if some description is true of a feature structure, 
unifying it with another feature structure results in a feature structure that 
still satisfies the original description. The unification operation is therefore 
order-independent; given a set of feature structures to unify, we can check 
them in any order and get the same result. Thus in the above example we 
could instead have chosen to check the agreement attribute first and the 
unification still would have failed. 

To summarize, unification is a way of implementing the integration of 
knowledge from different constraints. Given two compatible feature struc¬ 
tures as input, it produces the most general feature structure which nonethe¬ 
less contains all the information in the inputs. Given two incompatible fea¬ 
ture structures, it fails. 

11.3 Features Structures in the Grammar 

Our primary purpose in introducing feature structures and unification has 
been to provide a way to elegantly express syntactic constraints that would 


SEMILATTICE 


MONOTONIC 
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be difficult to express using the mechanisms of context-free grammars alone. 
Our next step, therefore, is to specify a way to integrate feature structures 
and unification operations into the specification of a grammar. This can be 
accomplished by augmenting the rules of ordinary context-free grammars 
with attachments that specify feature structures for the constituents of the 
rules, along with appropriate unification operations that express constraints 
on those constituents. From a grammatical point of view, these attachments 
will be used to accomplish the following goals: 

• To associate complex feature structures with both lexical items and 
instances of grammatical categories. 

• To guide the composition of feature structures for larger grammatical 
constituents based on the feature structures of their component parts. 

• To enforce compatibility constraints between specified parts of gram¬ 
matical constructions. 

We will use the following notation to denote the grammar augmenta¬ 
tions that will allow us to accomplish all of these goals, based on the PATR-II 
system described in Shieber (1986): 

Po — ^ Pi * * * Pra 

{set of constraints} 

The specified constraints have one of the following forms. 

(Pi feature path) = Atomic value 
(Pi feature path) = (P j feature path) 

The notation (Pi feature path) denotes a feature path through the feature 
structure associated with the Pi component of the context-free part of the 
rule. The first style of constraint specifies that the value found at the end 
of the given path must unify with the specified atomic value. The second 
form specifies that the values found at the end of the two given paths must 
be unifiable. 

To illustrate the use of these constraints, let us return to the informal 
solution to the number agreement problem proposed at the beginning of this 
chapter. 

S -> NP VP 

Only if the number of the NP is equal to the number of the VP. 
Using the new notation, this rule can now be expressed as follows. 

S —> NP VP 

{NP number) = {VP number) 
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Note that in cases where there are two or more constituents of the same 
syntactic category in a rule, we will subscript the constituents to keep them 
straight, as in VP —> VNP\ NP 2 

Taking a step back from the notation, it is important to note that in 
this approach the simple generative nature of context-free rules has been 
fundamentally changed by this augmentation. Ordinary context-free rules 
are based on the simple notion of concatenation; an NP followed by a VP 
is an 5, or generatively, to produce an S all we need to do is concatenate an 
NP to a VP. In the new scheme, this concatenation must be accompanied by 
a successful unification operation. This leads naturally to questions about 
the computational complexity of the unification operation and its effect on 
the generative power of this new grammar. These issues will be discussed in 
detail in Chapter 13. 

To review, there are two fundamental components to this approach. 

• The elements of context-free grammar rules will have feature-based 
constraints associated with them. This reflects a shift from atomic 
grammatical categories to more complex categories with properties. 

• The constraints associated with individual rules can refer to, and ma¬ 
nipulate, the feature structures associated with the parts of the rule to 
which they are attached. 

The following sections present applications of unification constraints 
to four interesting linguistic phenomena: agreement, grammatical heads, 
subcategorization, and long distance dependencies. 

Agreement 

As discussed in Chapter 9, agreement phenomena show up in a number 
of different places in English. This section illustrates how unification can 
be used to capture the two main types of English agreement phenomena: 
subject-verb agreement and determiner-nominal agreement. We will use the 
following ATIS sentences as examples throughout this discussion to illus¬ 
trate these phenomena. 

(11.9) This flight serves breakfast. 

(11.10) Does this flight serve breakfast? 

(11.11) Do these flights serve breakfast? 

Notice that the constraint used to enforce subject-verb agreement 
given above is deficient in that it ignores the person feature. The following 
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constraint which makes use of the agreement feature takes care of this 
problem. 

S NP VP 

( NP agreement) = {VP agreement) 

Examples 11.10 and 11.11 illustrate a minor variation on subject- 
verb agreement. In these Yes-No questions, the subject NP must agree 
with the auxiliary verb, rather than the main verb of the sentence, which 
appears in a non-finite form. This agreement constraint can be handled by 
the following rule. 

S AuxNP VP 

{Aux agreement) = {NP agreement) 

Agreement between determiners and nominals in noun phrases is han¬ 
dled in a similar fashion. The basic task is to allow the forms given above, 
but block the unwanted Hhis flights and Hhose flight forms where the de¬ 
terminers and nominals clash in their number feature. Again, the logical 
place to enforce this constraint is in the grammar rule that brings the parts 
together. 

NP —> Det Nominal 

{Det agreement) = {Nominal agreement) 

{NP agreement) = {Nominal agreement) 

This rule states that the agreement feature of the Det must unify with 
the agreement feature of the Nominal , and moreover, that the agree¬ 
ment feature of the NP is constrained to be the same as that of the Nominal. 

Having expressed the constraints needed to enforce subject-verb and 
determiner-nominal agreement, we must now fill in the rest of the machinery 
needed to make these constraints work. Specifically, we must consider how 
the various constituents that take part in these constraints (the Aux , VP. NP , 
Det , and Nominal ) acquire values for their various agreement features. 

We can begin by noting that our constraints involve both lexical and 
non-lexical constituents. The simpler lexical constituents, Aux and Det , re¬ 
ceive values for their respective agreement features directly from the lexicon 
as in the following rules. 

Aux —> do 

{AUX AGREEMENT NUMBER) = PL 

{Aux AGREEMENT PERSON) = 3 
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Aux —> does 

(S AGREEMENT NUMBER) = SG 
(S AGREEMENT PERSON) = 3 

Determiner —y this 

(Determiner agreement number) = sg 

Determiner —> these 

(Determiner agreement number) = pl 

Returning to our first S rule, let us first consider the agreement fea¬ 
ture for the VP constituent. The constituent structure for this VP is specified 
by the following rule. 

VP Verb NP 

It seems clear that the agreement constraint for this constituent must 
be based on its constituent verb. This verb, as with the previous lexical 
entries, can acquire its agreement feature values directly from lexicon as in 
the following rules. 

Verb serve 

(Verb AGREEMENT NUMBER) = PL 
Verb serves 

(Verb AGREEMENT NUMBER) = SG 
(Verb AGREEMENT PERSON) = 3 

All that remains is to stipulate that the agreement feature of the parent VTis 
constrained to be the same as its verb constituent. 

VP Verb NP 

(VP agreement) = (Verb agreement) 

In other words, non-lexical grammatical constituents can acquire values for 
at least some of their features from their component constituents. 

The same technique works for the remaining NP and Nominal cate¬ 
gories. The values for the agreement features for these categories are derived 
from the nouns flight and flights. 

Noun flight 

(Noun AGREEMENT NUMBER) = SG 
Noun —> flights 

(Noun AGREEMENT NUMBER) = PL 
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Similarly, the Nominal features are constrained to have the same values 
as its constituent noun, as follows. 

Nominal —> Noun 

(Nominal agreement) = {Noun agreement) 

Note that this section has only scratched the surface of the English 
agreement system, and that the agreement system of other languages can be 
considerably more complex than English. 

Head Features 

To account for the way compositional grammatical constituents such as noun 
phrases, nominals, and verb phrases come to have agreement features, the 
preceding section introduced the notion of copying needed feature structures 
from children to their parents. This use turns out to be a specific instance 
of a much more general phenomenon in constraint-based grammars. Specif¬ 
ically, the features for most grammatical categories are copied from one of 
the children to the parent. The child that provides the features is called the 
phras°e fthe head of the phrase, and the features copied are referred to as head features. 
features To make this clear, consider the following three rules from the last 

section. 

VP -> Verb NP 

{VP agreement) = {Verb agreement) 

NP —> Det Nominal 

{Det agreement) = {Nominal agreement) 

{NP agreement) = {Nominal agreement) 

Nominal —> Noun 

{Nominal agreement) = {Noun agreement) 

In each of these rules, the constituent providing the agreement feature 
structure up to the parent is the head of the phrase. More specifically, the 
verb is the head of the verb phrase, the nominal is the head of the noun 
phrase, and the noun is the head of the nominal. In addition, we can say that 
the agreement feature structure is a head feature. We can rewrite our rules to 
reflect these generalizations by placing the agreement feature structure under 
a head feature and then copying that feature upward as in the following 
constraints. 



Section 11.3. 


Features Structures in the Grammar 


407 


VP Verb NP (11.12) 

(VP head) = (Verb head) 

NP —> Det Nominal (11.13) 

(NP head) = (Nominal head) 

(Det head agreement) = (Nominal head agreement) 

Nominal Noun (11.14) 

(Nominal head) = (Noun head) 

Similarly, the lexical rules that introduce these features must now re¬ 
flect this HEAD notion, as in the following. 

Noun flights 

(Noun HEAD AGREEMENT NUMBER) = PL 
Verb —> serves 

(Verb HEAD AGREEMENT NUMBER) = SG 
(Verb HEAD AGREEMENT PERSON) = 3 

The notion of a head is an extremely significant one in grammar, be¬ 
cause it provides a way for a syntactic rule to be linked to a particular word. 
In this way heads will play an important role in the dependency grammars 
and lexicalized grammars of Chapter 12, and the head transducers men¬ 
tioned in Chapter 21. 

Subcategorization 

Recall that subcategorization is the notion that verbs can be picky about the 
patterns of arguments they will allow themselves to appear with. In Chap¬ 
ter 9, to prevent the generation of ungrammatical sentences with verbs and 
verb phrases that do not match, we were forced to split the category of verb 
into multiple sub-categories. These more specific verbs were then used in 
the definition of the specific verb phrases that they were allowed to occur 
with, as in the following rule. 

Verb-with-S-comp —> think 
VP —> Verb-with-S-comp S 

Clearly, this approach introduces exactly the same undesirable prolif¬ 
eration of categories that we saw with the similar approach to solving the 
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number problem. The proper way to avoid this proliferation is to introduce 
feature structures to distinguish among the various members of the verb cate¬ 
gory. This goal can be accomplished by associating an atomic feature called 
subcat, with an appropriate value, with each of the verbs in the lexicon. 
For example, the transitive version of serves could be assigned the following 
feature structure in the lexicon. 

Verb serves 

(Verb HEAD AGREEMENT NUMBER) = SG 
(Verb HEAD SUBCAT) = TRANS 

The subcat feature is a signal to the rest of the grammar that this verb 
should only appear in verb phrases with a single noun phrase argument. This 
constraint is enforced by adding corresponding constraints to all the verb 
phrase rules in the grammar, as in the following. 

VP Verb 

(VP head) = (Verb head) 

{VP HEAD SUB CAT) = INTRANS 

VP -> Verb NP 

(VP head) = (Verb head) 

(VP HEAD SUBCAT) = TRANS 

VP -> Verb NP NP 

(VP head) = (Verb head) 

{VP HEAD SUB CAT) = DITRANS 

The first unification constraint in these rules states that the verb phrase 
receives its head features from its verb constituent, while the second con¬ 
straint specifies what the value of that sub cat feature must be. Any attempt 
to use a verb with an inappropriate verb phrase will fail since the value of the 
SUBCAT feature of the VP will fail to unify with the atomic symbol given in 
second constraint. Note this approach requires unique symbols for each of 
the 50 to 100 verb phrase frames in English. 

This is a somewhat opaque approach since these unanalyzable subcat 
symbols do not directly encode either the number or type of the arguments 
that the verb expects to take. To see this, note that one can not simply exam¬ 
ine a verb’s entry in the lexicon and know what its subcategorization frame 
is. Rather, you must use the value of the SUB cat feature indirectly as a 
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pointer to those verb phrase rules in the grammar that can accept the verb in 
question. 

A somewhat more elegant solution, which makes better use of the ex¬ 
pressive power of feature structures, allows the verb entries to directly spec¬ 
ify the order and category type of the arguments they require. The following 
entry for serves is an example of one such approach, in which the verb’s 
subcategory feature expresses a list of its objects and complements. 

Verb serves 

(Verb HEAD AGREEMENT NUMBER) = SG 
(Verb HEAD SUBCAT FIRST CAT) = NP 
(Verb HEAD SUBCAT SECOND) = END 

This entry uses the first feature to state that the first post-verbal ar¬ 
gument must be an NP; the value of the second feature indicates that this 
verb expects only one argument. A verb like leave Boston in the morning , 
with two arguments, would have the following kind of entry. 

Verb^f leaves 

(Verb HEAD AGREEMENT NUMBER) = SG 
(Verb HEAD SUBCAT FIRST CAT) = NP 
(Verb HEAD SUBCAT SECOND CAT) = PP 
(Verb HEAD SUBCAT THIRD) = END 

This scheme is, of course, a rather baroque way of encoding a list; it is 
also possible to use the idea of types defined in Section 11.6 to define a list 
type more cleanly. 

The individual verb phrase rules must now check for the presence of 
exactly the elements specified by their verb, as in the following transitive 
rule. 


VP Verb NP (11.15) 

{VP head) = (Verb head) 

(VP HEAD SUB CAT FIRST CAT) = (NP CAT ) 

(VP HEAD SUBCAT SECOND) = END 

The second constraint in this rule’s constraints states that the category 
of the first element of the verb’s sub cat list must match the category of the 
constituent immediately following the verb. The third constraint goes on to 
state that this verb phrase rule expects only a single argument. 
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Our previous examples have shown rather simple subcategorization 
structures for verbs. In fact, verbs can subcategorize for quite complex sub- 

SUBCATEGO 

rization categorization frames, (e.g. NP PP, NP NP, or NP S ) and these frames can 
be composed of many different phrasal types. In order to come up with a list 
of possible subcategorization frames for English verbs, we first need to have 
a list of possible phrase types that can make up these frames. Figure 11.3 
shows one short list of possible phrase types for making up subcategorization 
frames for verbs; this list is modified from one used to create verb subcate¬ 
gorization frames in the FrameNet project (Baker el al , 1998), and includes 
phrase types for the subjects of verbs there, it, as well as objects and com¬ 
plements. 

To use the phrase types in Figure 11.3 in a unification grammar, each 
phrase type would have to be described using features. For example the form 
VPto which is subcategorized for by want might be expressed as: 

Verb —> want 

{Verb HEAD SUBCAT FIRST CAT) = YP 
(Verb HEAD SUBCAT FIRST FORM) = INFINITIVE 

Each of the 50 to 100 possible verb subcategorization frames in English 
would be described as a set drawn from these phrase types. For example, 
here’s an example of the two-complement want. We’ve used this following 
example to demonstrate two different notational possibilities. First, lists can 
be represented via an angle brackets notation ( and). Second, instead of us¬ 
ing a rewrite-rule annotated with path equations, we can represent the lexical 
entry as a single feature structure: 


ORTH WANT 
CAT VERB 


HEAD 


SUBCAT 


<’ 


CAT NP 


CAT VP 
HEAD [ 


VFORM INFINITIVE 


) 


Combining even a limited set of phrase types results in a very large set 
of possible subcategorization frames. Furthermore, each verb allows many 
different subcategorization frames. For example, here are just some of the 
subcategorization patterns for the verb ask, with examples from the BNC: 
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Noun Phrase Types 

There 

nonreferential there 

There is still much to learn.. . 

It 

nonreferential it 

It was evident that my ideas... 

NP 

noun phrase 

As he was relating his story... 

Preposition Phrase Types 

PP 

preposition phrase 

couch their message in terms... 

PPing 

gerundive PP 

censured him for not having intervened... 

PPpart 

Particle 

turn it off 

Verb Phrase Types 

VPbrst 

bare stem VP 

she could discuss it 

VPto 

to-marked infin. VP 

Why do you want to know? 

VPwh 

Wh- VP 

it is worth considering how to write 

VPing 

gerundive VP 

I would consider using it 

Complement Clause types 

Finite Clause 

Sfin 

finite clause 

maintain that the situation was unsatisfactory 

Swh- 

Wh- clause 

...it tells us where we are... 

Swheth 

Whether/if clause 

ask whether Aristophanes is depicting a... 

Nonfinite Clause 

Sing 

gerundive clause 

...see some attention being given... 

Sto 

to-marked clause 

know themselves to be relatively unhealthy 

Sforto 

for-to clause 

She was waiting for him to make some reply... 

Sbrst 

bare stem clause 

commanded that his sermons be published 

Other Types 

AjP 

adjective phrase 

thought it possible 

Quo 

quotes 

asked “What was it like?” 


Figure 11.3 A small set of potential phrase types which can be combined 
to create a set of potential subcategorization frames for verbs. Modified from 
the FrameNet tagset (Baker et al ., 1998). The sample sentence fragments are 
from the British National Corpus. 


Subcat 

Example 

Quo 

asked [ Quo “What was it like?”] 

NP 

asking [;vp a question] 

Swh 

asked [$ w h what trades you’re interested in] 

Sto 

ask [g to him to tell you] 

PP 

that means asking [pp at home] 

Vto 

asked [y to to see a girl called Evelyn] 

NP Swheth 

asked Lvp him] [gwheth whether he could make] 

NP NP 

asked [yjp myself] p a question] 

NP Swh 

asked [yyp him] [g w ^ why he took time off] 
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A number of comprehensive subcategorization-frame tagsets exist, such 
as the COMLEX set (Macleod et al , 1998), which includes subcategoriza¬ 
tion frames for verbs, adjectives, and nouns, and the ACQUILEX tagset of 
verb subcategorization frames (Sanfilippo, 1993). Many subcategorization- 
frame tagsets add other information about the complements, such as spec¬ 
ifying the identity of the subject in a lower verb phrase that has no overt 
control subject; this is called control information. For example Temmy promised 
Ruth to go (at least in some dialects) implies that Temmy will do the go¬ 
ing, while Temmy persuaded Ruth to go implies that Ruth will do the going. 
Some of the multiple possible subcategorization frames for a verb can be 
partially predicted by the semantics of the verb; for example many verbs of 
transfer (like give, send , carry) predictably take the two subcategorization 
frames NP NP and NP PP: 

NP NP sent FAA Administrator James Busey a letter 

NP PP sent a letter to the chairman of the Armed Services Committee 

These relationships between subcategorization frames across classes 
“ of verbs are called argument-structure alternations, and will be discussed 
in Chapter 16 when we discuss the semantics of verbal argument structure. 
Chapter 12 will introduce probabilities for modeling the fact that verbs gen¬ 
erally have a bias toward which of their possible they prefer. 

Subcategorization in Other Parts of Speech 

valence Although the notion of subcategorization, or valence as it is often called, was 
originally designed for verbs, more recent work has focused on the fact that 
many other kinds of words exhibit forms of valence-like behavior. Consider 
the following contrasting uses of the prepositions while and during. 

(11.16) Keep your seatbelt fastened while we are taking off. 

(11.17) *Keep your seatbelt fastened *while takeoff. 

(11.18) Keep your seatbelt fastened during takeoff. 

(11.19) *Keep your seatbelt fastened during we are taking off. 

Despite the apparent similarities between these words, they make quite dif¬ 
ferent demands on their arguments. Representing these differences is left as 
Exercise 11.5 for the reader. 

Many adjectives and nouns also have subcategorization frames. Here 
are some examples using the adjectives apparent , aware , and unimportant 
and the nouns assumption and question : 
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It was apparent [sfin that the kitchen was the only room... ] 

It was apparent [ppfrom the way she rested her hand over his] 
aware [sfin he may have caused offense] 
it is unimportant [swheth whether only a little bit is accepted] 
the assumption \sj m that wasteful methods have been employed] 
the question [swheth whether the authorities might have decided] 

See Macleod et al (1998) for a description of subcategorization frames 
for nouns and adjectives. 

Verbs express subcategorization constraints on their subjects as well as 
their complements. For example, we need to represent the lexical fact that 
the verb seem can take a Sfin as its subject (That she was affected seems 
obvious ), while the verb paint cannot. The subject feature can be used to 
express these constraints. 

Long Distance Dependencies 

The model of subcategorization we have developed so far has two compo¬ 
nents. Each head word has a subcat feature which contains a list of the 
complements it expects. Then phrasal rules like the VP rule in (11.16) match 
up each expected complement in the sub cat list with an actual constituent. 

This mechanism works fine when the complements of a verb are in fact to 
be found in the verb phrase. 

Sometimes, however, a constituent subcategorized for by the verb is 
not locally instantiated, but is in a long-distance relationship with the pred- lonq 
icate. Here are some examples of such long-distance dependencies: depb 

CIES 

What cities does Continental service? 

What flights do you have from Boston to Baltimore? 

What time does that flight leave Atlanta? 

In the first example, the constituent what cities is subcategorized for 
by the verb service , but because the sentence is an example of a wh-non- 
subject-question, the object is located at the front of the sentence. Recall 
from Chapter 9 that a (simple) phrase-structure rule for a wh-non-subject- 
question is something like the following: 

S Wh-NP Awe NP VP 

Now that we have features, we’ll be able to augment this phrase-structure 
rule to require the Aux and the NP to agree (since the NP is the subject). 

But we also need some way to augment the rule to tell it that the Wh-NP 
should fill some subcategorization slot in the VP. The representation of such 
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GAP LIST 

FILLER 

11.4 


long-distance dependencies is a quite difficult problem, because the verb 
whose subcategorization requirement is being filled can be quite distant from 
the filler. In the following (made-up) sentence, for example, the w/z-phrase 
which flight must fill the subcategorization requirements of the verb book, 
despite the fact that there are two other verbs (want and have) in between: 

Which flight do you want me to have the travel agent book? 

Many solutions to representing long-distance dependencies in unifica¬ 
tion grammars involve keeping a list, often called a gap list, implemented 
as a feature gap, which is passed up from phrase to phrase in the parse 
tree. The filler (for example which flight above) is put on the gap list, and 
must eventually be unified with the subcategorization frame of some verb. 
See Sag and Wasow (1999) for an explanation of such a strategy, together 
with a discussion of the many other complications that must be modeled in 
long-distance dependencies. 


Implementing Unification 

As discussed, the unification operator takes two feature structures as input 
and returns a single merged feature structure if successful, or a failure sig¬ 
nal if the two inputs are not compatible. The input feature structures are 
represented as directed acyclic graphs (DAGs), where features are depicted 
as labels on directed edges, and feature values are either atomic symbols or 
DAGs. As we will see, the implementation of the operator is a relatively 
straightforward recursive graph matching algorithm, suitably tailored to ac¬ 
commodate the various requirements of unification. Roughly speaking, the 
algorithm loops through the features in one input and attempts to find a corre¬ 
sponding feature in the other. If all of the features match, then the unification 
is successful. If any single feature causes a mismatch then the unification 
fails. Not surprisingly, the recursion is motivated by the need to correctly 
match those features that have feature structures as their values. 

One somewhat unusual aspect of the algorithm is that rather than con¬ 
struct a new output feature structure with the unified information from all the 
information from the two arguments, it destructively alters the arguments so 
that in the end they point to exactly the same information. Thus the result 
of a successful call to the unification operator consists of suitably altered 
versions of the arguments (failed unifications also result in alterations to the 
arguments, but more on that later in Section 11.5.) As is discussed in the 
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next section, the destructive nature of this algorithm necessitates certain mi¬ 
nor extensions to the simple graph version of feature structures as DAGs we 
have been assuming. 


Unification Data Structures 


To facilitate the destructive merger aspect of the algorithm, we add a small 
complication to the DAGs used to represent the input feature structures; fea¬ 
ture structures are represented using DAGs with additional edges, or fields. 
Specifically, each feature structure consists of two fields: a content field and 
a pointer field. The content field may be null or contain an ordinary fea¬ 
ture structure. Similarly, the pointer field may be null or contain a pointer to 
another feature structure. If the pointer field of the DAG is null, then the con¬ 
tent field of the DAG contains the actual feature structure to be processed. 
If, on the other hand, the pointer field is non-null, then the destination of the 
pointer represents the actual feature structure to be processed. Not surpris¬ 
ingly, the merger aspects of unification are achieved by altering the pointer 
field of DAGs during processing. 

To make this scheme somewhat more concrete, consider the extended 
DAG representation for the following familiar feature structure. 


( 11 . 20 ) 


NUMBER 

PERSON 


SG 

3 


The extended DAG representation is illustrated with our textual matrix dia¬ 
grams by treating the content and pointer fields as ordinary features, as 
in the following matrix. 


( 11 . 21 ) 


CONTENT 


POINTER 



CONTENTS 

SG 


NUMBER 

POINTER 

NULL 



CONTENTS 

3 


PERSON 





POINTER 

NULL 



NULL 


Figure 11.4 shows this extended representation in its graphical form. 
Note that the extended representation contains content and pointer links both 
for the top-level layer of features, as well as for each of the embedded feature 
structures all the way down to the atomic values. 

Before going on to the details of the unification algorithm, we will 
illustrate the use of this extended DAG representation with the following 
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Figure 11.4 An extended DAG notation for Examples 11.20 and 11.21 . 


simple example. The original extended representation of the arguments to 
this unification are shown in Figure 11.5. 


( 11 . 22 ) 


U f 


NUMBER SG U PERSON 3 


NUMBER SG 
PERSON 3 
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At a high level, we would simply say that the unification results in the 
creation of a new structure containing the union of the information from the 
two original arguments. With the extended notation, we can see how the uni¬ 
fication is accomplished by making some additions to the original arguments 
and changing some of the pointers from one structure to the other so that in 
the end they contain the same content. In this example, this is accomplished 
by first adding a PERSON feature to the first argument, and assigning it a 
value by filling its pointer field with a pointer to the appropriate location 
in the second argument, as shown in Figure 11.6. 
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The process is, however, not yet complete. While it is clear from Figure 
11.6 that the first argument now contains all the correct information, the 
second one does not; it lacks a number feature. We could, of course, add 
a number feature to this argument with a pointer to the appropriate place 
in the first one. This change would result in the two arguments having all 
the correct information from this unification. Unfortunately, this solution is 
inadequate since it does not meet our requirement that the two arguments be 
truly unified. Since the two arguments are not completely unified at the top 
level, future unifications involving one of the arguments would not show up 
in the other. The solution to this problem is to simply set the pointer field 
of the second argument to point at the first one. When this is done any future 
change to either argument will be immediately reflected in both. The result 
of this final change is shown in Figure 11.7. 
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function Unify(/7,/2) returns fstructure or failure 

f 1-real Real contents of/7 
f2-real<- Real contents of/2 

if/ 1-real is null then 
fl.pointer <-f2 

return/2 

else if f2-real is null then 
f2.pointer <-fl 

return fl 

else if fl-real mdf2-real are identical then 
f 1.pointer 4—f2 

return/2 

else if both fl-real mdf2-real are complex feature structures then 
/2.j pointer i-fl 
for each feature in f2 -real do 
other-feature <- Find or create 

a feature corresponding to feature in fl-real 
if Unify (feature.value, other-feature.value) returns failure then 
return failure 
return// 

else return failure 


Figure 11.8 The Unification Algorithm. 


The Unification Algorithm 

The unification algorithm that we have been leading up to is shown in Figure 
11.8. To review, this algorithm accepts two feature structures represented 
using the extended DAG representation. As can be seen from the code, it 
may return as its return either one of these arguments. This is, however, 
somewhat deceptive since the true effect of this algorithm is the destructive 
unification of the two inputs. 

The first step in this algorithm is to acquire the true contents of both of 
the arguments. Recall that if the pointer field of an extended feature structure 
is non-null, then the real content of that structure is found by following the 
pointer found in pointer field. The variables f 1-real and f2-real are the result 
of this pointer following process, which is often referred to as dereferenc¬ 
ing. 
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As with all recursive algorithms, the next step is to test for the various 
base cases of the recursion before proceeding on to a recursive call involving 
some part of the original arguments. In this case, there are three possible base 
cases: 

• One or both of the arguments has a null value. 

• The arguments are identical. 

• The arguments are non-complex and non-identical. 

In the case where either of the arguments is null, the pointer field for 
the null argument is changed to point to the other argument, which is then 
returned. The result is that both structures now point at the same value. 

If the structures are identical, then the pointer of the first is set to the 
second and the second is returned. It is important to understand why this 
pointer change is done in this case. After all, since the arguments are iden¬ 
tical, returning either one would appear to suffice. This might be true for a 
single unification but recall that we want the two arguments to the unification 
operator to be truly unified. The pointer change is necessary since we want 
the arguments to be truly identical, so that any subsequent unification that 
adds information to one will add it to both. 

If neither of the preceding tests is true then there are two possibili¬ 
ties: they are non-identical atomic values, or they are non-identical complex 
structures. The former case signals an incompatibility in the arguments that 
leads the algorithm to return a failure signal. In the latter case, a recursive 
call is needed to ensure that the component parts of these complex structures 
are compatible. In this implementation, the key to the recursion is a loop 
over all the features of the second argument,/?. This loop attempts to unify 
the value of each feature in J2 with the corresponding feature in fl . In this 
loop, if a feature is encountered in J2 that is missing from//, a feature is 
added to fl and given the value null. Processing then continues as if the 
feature had been there to begin with. If every one of these unifications suc¬ 
ceeds, then the pointer field of J2 is set to fl completing the unification of the 
structures and fl is returned as the value of the unification. 

We should note that an unfortunate aspect of this algorithm is that it 
is capable of producing feature structures containing cycles. This situation 
can arise when the algorithm is asked to unify a structure with a second 
structure that contains the first as a subpart. The way to avoid this situation 
is to employ what is called an occur check (Robinson, 1965). This check 
analyzes the input DAGs and returns failure when one of the arguments is 
contained as a subpart of the other. In practice, this check is omitted from 
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most implementations due to its computational cost. 


An Example 


To illustrate this algorithm, let us walk through the following example. 


(11.23) 


AGREEMENT 

SUBJECT 


m 


NUMBER 


AGREEMENT 


sgJ 

CD 


I_ 


-■ _i 

1 

SUBJECT 

AGREEMENT 

PERSON 3 



Figure 11.9 shows the extended representations for the arguments to 
this unification. Note how the reentrant structure in the first argument is 
captured through the use of the ptr field. 

These original arguments are neither identical, nor null, nor atomic, so 
the main loop is entered. Looping over the features of fl , the algorithm is 
led to a recursive attempt to unify the values of the corresponding subject 
features of/7 and J2. 
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AGREEMENT Ej U 

These arguments are also non-identical, non-null, and non-atomic so 
the loop is entered again leading to a recursive check of the values of the 
AGREEMENT features. 


AGREEMENT 


PERSON 3 


NUMBER 


SG 


U 


PERSON 


3 


In looping over the features of the second argument, the fact that the 
first argument lacks a PERSON feature is discovered. A PERSON feature 
initialized with a null value is, therefore, added to the first argument. This, 
in effect, changes the previous unification to the following. 


NUMBER 

PERSON 


SG 

NULL 


U 


PERSON 


3 


After creating this new person feature, the next recursive call leads 
to the unification of the null value of the new feature in the first argument 
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with the 3 value of the second argument. This recursive call results in the 
assignment of the pointer field of the first argument to the 3 value in f2, as 
shown in 11.10. 

Since there are no further features to check in the f2 argument at any 
level of recursion, each in turn sets the pointer for its J2 argument to point at 
its fl argument and returns it. The result of all these assignments is shown 
in Figure 11.11. 


11.5 Parsing with Unification Constraints 

We now have all the pieces necessary to the integrate feature structures and 
unification into a parser. Fortunately, the order-independent nature of unifi¬ 
cation allows us to largely ignore the actual search strategy used in the parser. 
Once we have unification constraints associated with the context-free rules 
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of the grammar, and feature structures with the states of the search, any of 
the standard search algorithms described in Chapter 10 can be used. 

Of course, this leaves a fairly large range of possible implementation 
strategies. We could, for example, simply parse as we did before using the 
context-free components of the rules, and then build the feature structures 
for the resulting trees after the fact, filtering out those parses that contain 
unification failures. Although such an approach would result in only well- 
formed structures in the end, it fails to use the power of unification to reduce 
the size of the parser’s search space during parsing. 

The next section describes an approach that makes better use of the 
power of unification by integrating unification constraints directly into the 
Earley parsing process, allowing ill-formed structures to be eliminated as 
soon as they are proposed. As we will see, this approach requires only min¬ 
imal changes to the basic Earley algorithm. We then move on to briefly 
consider an approach to unification parsing that moves even further away 
from standard context-free methods. 

Integrating Unification into an Earley Parser 

We have two goals in integrating feature structures and unification into the 
Earley algorithm: to use feature structures to provide a richer representation 
for the constituents of the parse, and to block the entry into the chart of ill- 
formed constituents that violate unification constraints. As we will see, these 
goals can be accomplished via fairly minimal changes to the original Earley 
scheme given on page 378. 

The first change involves the various representations used in the orig¬ 
inal code. Recall that the Earley algorithm operates by using a set of un¬ 
adorned context-free grammar rules to fill in a data-structure called a chart 
with a set of states. At the end of the parse, the states that make up this chart 
represent all possible parses of the input. Therefore, we begin our changes 
by altering the representations of both the context-free grammar rules, and 
the states in the chart. 

The rules are altered so that in addition to their current components, 
they also include a feature structure derived from their unification constraints. 
More specifically, we will use the constraints listed with a rule to build a fea¬ 
ture structure, represented as a DAG, for use with that rule during parsing. 

Consider the following context-free rule with unification constraints. 
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S —► NP VP 

(NP HEAD AGREEMENT) = (VP HEAD AGREEMENT) 

(S HEAD) = (VP HEAD) 

Converting these constraints into a feature structure results in the following 
structure: 


s 

HEAD 

m 


NP 

HEAD 

AGREEMENT E 

VP 

HEAD 

E 

AGREEMENT E 


In this derivation, we combined the various constraints into a single structure 
by first creating top-level features for each of the parts of the context-free 
rule, s, np, and vp in this case. We then add further components to this 
structure by following the path equations in the constraints. Note that this 
is a purely notational conversion; the DAGs and the constraint equations 
contain the same information. However, tying the constraints together in a 
single feature structure puts it in a form that can be passed directly to our 
unification algorithm. 

The second change involves the states used to represent partial parses 
in the Earley chart. The original states contain fields for the context-free rule 
being used, the position of the dot representing how much of the rule has 
been completed, the positions of the beginning and end of the state, and a list 
of other states that represent the completed sub-parts of the state. To this set 
of fields, we simply add an additional field to contain the DAG representing 
the feature structure corresponding to the state. Note that when a rule is 
first used by Predictor to create a state, the DAG associated with the state 
will simply consist of the DAG retrieved from the rule. For example, when 
Predictor uses the above S rule to enter a state into the chart, the DAG 
given above will be its initial DAG. WeTl denote states like this as follows, 
where Dag denotes the feature structure given above. 

S^.NPVP, [0,0],0,Dag 

Given these representational additions, we can move on to altering 
the algorithm itself. The most important change concerns the actions that 
take place when a new state is created via the extension of an existing state, 
which takes place in the Completer routine. Recall that Completer is 
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called when a completed constituent has been added to the chart. Its task 
is to attempt to find, and extend, existing states in the chart that are looking 
for constituents that are compatible with the newly completed constituent. 
Completer is, therefore, a function that creates new states by combining 
the information from two other states, and as such is a likely place to apply 
the unification operation. 

To be more specific, Completer adds a new state into the chart by 
finding an existing state whose • can be advanced by the newly completed 
state. A • can be advanced when the category of the constituent immediately 
following it matches the category of the newly completed constituent. To 
accommodate the use of feature structures, we can alter this scheme by uni¬ 
fying the feature structure associated with the newly completed state with the 
appropriate part of the feature structure being advanced. If this unification 
succeeds, then the DAG of the new state receives the unified structure and is 
entered into the chart, if it fails then no new state is entered into the chart. 
The appropriate alterations to Completer are showin in Figure 11.12. 

Consider this process in the context of parsing the phrase That flight, 
where the That has already been seen, as is captured by the following state. 


NP —> DeUNominal[ 0,1], [Sz)<*],Dagi 


Dagi 


NP 


DET 


NOMINAL 


HEAD \B 


HEAD 


AGREEMENT H] NUMBER SG 


HEAD Q] AGREEMENT \2\ 


Now consider the later situation where the parser has processed flight and 
has subsequently produced the following state. 

Nominal —> Noun •, [1,2], [Sjsj oun ]^Dag 2 
Dag 2 


NOMINAL 


NOUN 


HEAD [□ 


HEAD CD 


AGREEMENT 


NUMBER SG 


To advance the NP rule, the parser unifies the feature structure found under 
the nominal feature of Dag 2 , with the feature structure found under the 
NOMINAL feature of the NP' s Dag i. As in the original algorithm, a new state 
is created to represent the fact that an existing state has been advanced. This 
new state’s DAG is given the DAG that resulted from the above unification. 
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function EARLEY-PARSE(vv<9nis, grammar) returns chart 

Enqueue((y [0,0], dagy) 7 chart[0]) 

for zA-from 0 to Length(hw^) do 
for each state in chart[i ] do 
if Incomplete l(state) and 

NEXT-CATUta/e) is not a part of speech then 
Predictor 

elseif Incomplete?U^) and 

Next-CatU/(3^) is a part of speech then 
SCANNEROtate) 

else 

COMPLETER(^ate) 

end 

end 

return (chart) 

procedure Predictor((A a •B (3, [ij\ |, dagA)) 

for each (B —> y) in Grammar-Rules-For(7?, grammar) do 
Enqueue^ ->• «y, [ jj ], dag B \ chart \ j ]) 

end 

procedure Scanner((A a • £ (3, dag A )) 
if B c PARTS-OF-SPEECH(wc>r^7) then 

Enqueue((£ word[j\ , [7,7+1], dag B ), chart[j+1]) 

procedure Completer(( 7? y •, [/,fc], dags)) 
for each (A a • B (3, [ij\, dag A ) in chart\j ] do 
if new-dag^\]mFY-STATES(dagB,dag A ,B) ^ Fails! 
Enqueue((A -a- a£*3, [i,k\,new- dag\chart[k]) 

end 

procedure Unify- S tat ES(dag7, dag2, cat) 
dagl-cp +- C0PYDAG(7ag7) 
dag2-cp CopyDag (dag2) 

UNiFY(FOLLOW-PATH(ca/, dagl-cp ), FOLLOW-PATH(cat, dag2-cp)) 

procedure Enqueue (state, chart-entry) 

if state is not subsumed by a state in chart-entry then 
Push {state, chart-entry) 

end 


Figure 11.12 Modifications to the Earley algorithm to include unification. 
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The final change to the original algorithm concerns the check for states 
already contained in the chart. In the original algorithm, the Enqueue func¬ 
tion refused to enter into the chart any state that was identical to one already 
present in the chart. Where identical meant the same rule, with the same 
start and finish positions, and the same position of the •. It is this check 
that allows the algorithm to, among other things, avoid the infinite recursion 
problems associated with left-recursive rules. 

The problem, of course, is that our states are now more complex since 
they have complex feature structures associated with them. States that ap¬ 
peared identical under the original criteria might in fact now be different 
since their associated DAGs may differ. The obvious solution to this prob¬ 
lem is to simply extend the identity check to include the DAGs associated 
with the states, but it turns out that we can improve on this solution. 

The motivation for the improvement lies in the motivation for the iden¬ 
tity check. Its purpose is to prevent the wasteful addition of a state into the 
chart whose effect on the parse would be accomplished by an already exist¬ 
ing state. Put another way, we want to prevent the entry into the chart of 
any state that would duplicate the work that will eventually be done by other 
states. Of course, this will clearly be the case with identical states, but it 
turns out it is also the case for states in the chart that are more general than 
new states being considered. 

Consider the situation where the chart contains the following state, 
where the Dag places no constraints on the Det. 

NP —► •Det NP : [/, /],[], Dag 

Such a state simply says that it is expecting a Det at position /, and that any 
Det will do. 

Now consider the situation where the parser wants to insert a new state 
into the chart that is identical to this one, with the exception that its DAG 
restricts the Det to be singular. In this case, although the states in question 
are not identical, the addition of the new state to the chart would accomplish 
nothing and should therefore be prevented. 

To see this let’s consider all the cases. If the new state is added, then a 
subsequent singular Det will match both rules and advance both. Due to the 
unification of features, both will have DAGs indicating that their Dets are 
singular, with the net result being duplicate states in the chart. If on the other 
hand, a plural Det is encountered, the new state will reject it and not advance, 
while the old rule will advance, entering a single new state into the chart. On 
the other hand, if the new state is not placed in the chart, a subsequent plural 
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or singular Det will match the more general state and advance it, leading to 
the addition of one new state into the chart. Note that this leaves us in exactly 
the same situation as if the new state had been entered into the chart, with 
the exception that the duplication is avoided. In sum, nothing worthwhile is 
accomplished by entering into the chart a state that is more specific than a 
state already in the chart. 

Fortunately, the notion of subsumption described earlier gives us a for¬ 
mal way to talk about the generalization and specialization relations among 
feature structures. This suggests that the proper way to alter Enqueue is to 
check if a newly created state is subsumed by any existing states in the chart. 
If it is, then it will not be allowed into the chart. More specifically, a new 
state that is identical in terms of its rule, start and finish positions, subparts, 
and • position, to an existing state, will be not be entered into the chart if its 
DAG is subsumed by the DAG of an existing state (ie. if DagM □ Dag new ). 
The necessary change to the original Earley Enqueue procedure is shown 
in Figure 11.12. 

The Need for Copying 

The calls to CopyDag within the Unify-State procedure require some 
elaboration. Recall that one of the strengths of the Earley algorithm (and of 
the dynamic programming approach in general) is that once states have been 
entered into the chart they may be used again and again as part of different 
derivations, including ones that in the end do not lead to successful parses. 
This ability is the motivation for the fact that states already in the chart are 
not updated to reflect the progress of their •, but instead are copied are then 
updated, leaving the original states intact so that they can be used again in 
further derivations. 

The call to CopyDag in Unify-State is required to preserve this be¬ 
havior because of the destructive nature of our unification algorithm. If we 
simply unified the DAGS associated the existing states, those states would 
be altered by the unification, and hence would not be available in the same 
form for subsequent uses by the Completer function. Note that has nega¬ 
tive consequences regardless of whether the unification succeeds or fails, in 
either case the original states are altered. 

Let’s consider what would happen if the call to CopyDag was absent 
in the following example where an early unification attempt fails. 

(11.24) Show me morning flights. 

Let’s assume that our parser has the following entry for the ditransitive ver- 



430 


Chapter 11. Features and Unification 


sion of the verb show, as well as the following transitive and ditransitive verb 
phrase rules. 


Verb show 

{Verb HEAD SUBCAT FIRST CAT) = NP 
{Verb HEAD SUBCAT SECOND CAT) = NP 
{Verb HEAD SUBCAT THIRD) = END 

VP -> Verb NP 

{VP head) = {Verb head) 

{VP HEAD SUB CAT FIRST CAT) = {NP CAT ) 

{VP HEAD SUBCAT SECOND) = END 

VP Verb NP NP 

{VP head) = {Verb head) 

{VP HEAD SUB CAT FIRST CAT) = {NP\ CAT ) 

{VP HEAD SUB CAT SECOND CAT) = {NP 2 CAT ) 

{VP HEAD SUB CAT THIRD) = END 

When the word me is read, the state representing transitive verb phrase 
will be completed since its dot has moved to the end. Completer will, 
therefore, call Unify-States before attempting to enter this complete state 
into the chart. This will fail since the sub cat structures of these two rules 
can not be unified. This is, of course, exactly what we want since this version 
of show is ditransitive. Unfortunately, because of the destructive nature of 
our unification algorithm we have already altered the DAG attached to the 
state representing show , as well as the one attached to the VP thereby ruining 
them for use with the correct verb phrase rule later on. Thus, to make sure 
that states can be used again and again with multiple derivations, copies are 
made of the dags associated with states before attempting any unifications 
involving them. 

We should note that all of this copying can be quite expensive. As a 
result, a number of alternative techniques have been developed that attempt 
to minimize this cost (Pereira, 1985; Karttunen and Kay, 1985; Tomabechi, 
1991; Kogure, 1990). Kiefer et al (1999) describe a set of related techniques 
used to speed up a large unification-based parsing system. 
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Unification Parsing 

A more radical approach to using unification in parsing can be motivated 
by looking at an alternative way of denoting our augmented grammar rules. 
Consider the following S rule that we have been using throughout this chap¬ 
ter. 

S —► NP VP 

(NP HEAD AGREEMENT) = (VP HEAD AGREEMENT) 

(S HEAD) = (VP HEAD) 

An interesting way to alter the context-free part of this rule is to change 
the way the its grammatical categories are specified. In particular, we can 
place the categorical information about the parts of the rule inside the fea¬ 
ture structure, rather than inside the context-free part of the rule. A typi¬ 
cal instantiation of this approach would give us the following rule (Shieber, 
1986). 

*0 *1 *2 
(X 0 cat) = S 
{Xi cat) =NP 
(X 2 cat) = VP 

(Xi HEAD AGREEMENT) = (X 2 HEAD AGREEMENT) 

( X 0 HEAD) = (X 2 HEAD) 

Focusing solely on the context-free component of the rule, this rule 
now simply states that the X 0 constituent consists of two components, and 
that the the X\ constituent is immediately to the left of the X 2 constituent. 
The information about the actual categories of these components is placed 
inside the rule’s feature structure; in this case, indicating that Xq is an 5, X\ 
is an NP, and X 2 is a VP. Altering the Earley algorithm to deal with this 
notational change is trivial. Instead of seeking the categories of constituents 
in the context-free components of the rule, it simply needs to look at the cat 
feature in the DAG associated with a rule. 

Of course, since it is the case that these two rules contain precisely the 
same information, it isn’t clear that there is any benefit to this change. To 
see the potential benefit of this change, consider the following rules. 

*0 *1 *2 

(Xq cat) = { Xi cat) 

(X 2 cat) = PP 
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Ao — y X\ and X 2 
(X\ cat) = { X 2 cat) 

( X 0 cat) = { X\ cat) 

The first rule is an attempt to generalize over various rules that we have 
already seen, such as NP — y NP PP and VP — y VP PP. It simply states that 
any category can be followed by a prepositional phrase, and that the resulting 
constituent has the same category as the original. Similarly, the second rule 
is an attempt to generalize over rules such as S —y S and S, NP —y NP and NP , 
and so on. 1 It states that any constituent can be conjoined with a constituent 
of the same category to yield a new category of the same type. What these 
rules have in common is their use of context-free rules that contain con¬ 
stituents with constrained, but unspecified, categories, something that can 
not be accomplished with our old rule format. 

Of course, since these rules rely on the use the cat feature, their ef¬ 
fect could be approximated in the old format by simply enumerating all the 
various instantiations of the rule. A more compelling case for the new ap¬ 
proach is motivated by the existence of grammatical rules, or constructions, 
that contain constituents that are not easily characterized using any existing 
syntactic category. 

Consider the following examples of the English How-Many con¬ 
struction from the WSJ (Jurafsky, 1992). 

(11.25) How early does it open? 

(11.26) How deep is her Greenness? 

(11.27) How papery are your profits? 

(11.28) How quickly we forget. 

(11.29) How many of you can name three famous sporting Blanchards? 

As is illustrated in these examples, the How-Many construction has two 
components: the lexical item how , and a lexical item or phrase that is rather 
hard to characterize syntactically. It is this second element that is of interest 
to us here. As these examples show, it can be an adjective, adverb, or some 
kind of quantified phrase (although not all members of these categories yield 
grammatical results). Clearly, a better way to describe this second element 
is as a scalar concept, a constraint can captured using feature structures, as 
in the following rule. 

1 These rules should not be mistaken for correct, or complete, accounts of the phenomena 
in question. 
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*o ^ *2 

{Xi orth) = {how} 

(X 2 sem) = ( scalar) 

A complete account of rules like this involves semantics and will therefore 
have to wait for Chapter 14. The key point here is that by using feature 
structures a grammatical rule can place constraints on its constituents in a 
manner that does not make any use of the notion of a syntactic category. 

Of course, dealing this kind of rule requires some changes to our pars¬ 
ing scheme. All of the parsing approaches we have considered thus far are 
driven by the syntactic category of the various constituents in the input. More 
specifically, they are based on simple atomic matches between the categories 
that have been predicted, and categories that have been found. Consider, for 
example, the operation of the Completer function shown in Figure 11.12. 
This function searches the chart for states that can be advanced by a newly 
completed state. It accomplishes this by matching the category of the newly 
completed state against the category of the constituent following the • in the 
existing state. Clearly this approach will run into trouble when there are no 
such categories to consult. 

The remedy for this problem with Completer is to search the chart 
for states whose DAGs unify with the DAG of the newly completed state. 
This eliminates any requirement that states or rules have a category. The 
Predictor can be changed in a similar fashion by having it add states to 
the chart states whose X 0 DAG component can unify with the constituent 
following the • of the predicting state. Exercise 11.6 asks you to make the 
necessary changes to the pseudo-code in Figure 11.12 to effect this style of 
parsing. Exercise 11.7 asks you to consider some of the implications of these 
alterations, particularly with respect to prediction. 

11.6 Types and Inheritance 

I am surprised that ancient and modern writers have not attributed 

greater importance to the laws of inheritance ... 

(de Tocqueville, 1966) 

The basic feature structures we have presented so far have two prob¬ 
lems that have led to extensions to the formalism. The first problem is that 
there is no way to place a constraint on what can be the value of a feature. 
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For example, we have implicitly assumed that the number attribute can take 
only SG and PL as values. But in our current system, there is nothing, for ex¬ 
ample, to stop number from have the value 3rd or feminine as values: 

[number feminine 

This problem has caused many unification-based grammatical theories 
to add various mechanisms to try to constrain the possible values of a fea¬ 
ture. Formalisms like Functional Unification Grammar (FUG) (Kay, 1979, 
1984, 1985) and Lexical Functional Grammar (LFG) (Bresnan, 1982), for 
example, focused on ways to keep intransitive verb like sneeze from uni¬ 
fying with a direct object (Marie sneezed Pauline). This was addressed in 
none FUG by adding a special atom none which is not allowed to unify with any¬ 
thing, and in LFG by adding coherence conditions which specified when a 
feature should not be filled. Generalized Phrase Structure (GPSG) (Gazdar 
et al , 1985, 1988) added a class of feature co-occurrence restrictions, to 
prevent, for example, nouns from having some verbal properties. 

The second problem with simple feature structures is that there is no 
way to capture generalizations across them. For example, the many types of 
English verb phrases described in the Subcategorization section on page 407 
share many features, as do the many kinds of subcategorization frames for 
verbs. Syntactitions were looking for ways to express these generalities 
types A general solution to both of these problems is the use of types. Type 

systems for unification grammars have the following characteristics: 


APPROPRI¬ 

ATENESS 


TYPE 

HIERARCHY 


TYPED 

FEATURE 

STRUCTURE 


SIMPLE 

TYPES 

COMPLEX 

TYPES 


TYPE 

HIERARCHY 


1. Each feature structure is labeled by a type. 

2. Conversely, each type has appropriateness conditions expressing which 
features are appropriate for it. 

3. The types are organized into a type hierarchy, in which more specific 
types inherit properties of more abstract ones. 

4. The unification operation is modified to unify the types of feature struc¬ 
tures in addition to unifying the attributes and values. 

In such typed feature structure systems, types are a new class of 
objects, just like attributes and values were for standard feature structures. 
Types come in two kinds: simple types (also called atomic types), and com¬ 
plex types. Let’s begin with simple types. A simple type is an atomic sym¬ 
bol like sg or pi (we will use boldface for all types), and replaces the simple 
atomic values used in standard feature structures. All types are organized 
into a multiple-inheritance type hierarchy (a partial order or lattice). Fig- 



Section 11.6. Types and Inheritance 


435 


ure 11.13 shows the type hierarchy for the new type agr, which will be the 
type of the kind of atomic object that can be the value of an agree feature. 



3sg-masc 3sg-fem 3sg-neut 


Figure 11.13 A simple type hierarchy for the subtypes of type agr which 
can be the value of the agree attribute. After Carpenter (1992). 


In the hierarchy in Figure 11.13, 3rd is a subtype of agr, and 3-sg is 
a subtype of both 3rd and sg. Types can be unified in the type hierarchy; 
the unification of any two types is the most-general type that is more specific 
than the two input types. Thus: 

3rd U sg = 3sg 

1st U pi = lpl 

1st U agr = 1st 

3rd U 1st = undefined 

The unification of two types which do not have a defined unifier is 
undefined, although it is also possible to explicitly represent this fail type 
using the symbol _L (Ait-Kaci, 1984). 

The second kind of types are complex types, which specify: 

• A set of features that are appropriate for that type 

• Restrictions on the values of those features (expressed in terms of 

types) 

• Equality constraints between the values 

Consider a simplified representation of the complex type verb, which 
just represents agreement and verb morphological form information. A defi¬ 
nition of verb would define the two appropriate features, agree and vform, 
and would also define the type of the values of the two features. Let’s sup¬ 
pose that the agree feature takes values of type agr defined in Figure 11.13 
above, and the vform feature takes values of type vform (where vform sub¬ 
sumes the 7 subtypes finite, infinitive, gerund, base, present-participle, 
past-participle, and passive-participle. Thus verb would be defined as fol- 


SUBTYPE 


FAIL TYPE 
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lows (where the convention is to indicate the type either at the type of the 
AYM or just to the lower left of the left bracket): 

verb 

AGREE agr 

vform vform 

By contrast, the type noun might be defined with the agree feature, 
but without the vform feature: 

noun 

AGREE agr 

The unification operation is augmented for typed feature structures just 
by requiring that the type of the two structures must unify in addition to the 
values of the component features unifying. 


verb 

U 

verb 

= 

verb 

AGREE 1st 


AGREE Sg 


AGREE 1-Sg 

vform gerund 


vform gerund 


vform gerund 


Complex types are also part of the type hierarchy. Subtypes of complex 
types inherit all the features of their parents, together with the constraints 
on their values. Sanfilippo (1993), for example, uses the type hierarchy to 
encode the hierarchical structure of the lexicon. Figure 11.14 shows a small 
part of this hierarchy, the part that models the various subcategories of verbs 
which take sentential complements; these are divided into the transitive ones 
(which take direct objects: (ask yourself whether you have become better 
informed) and the intransitive ones (Monsieur asked whether I wanted to 
ride). The type trans-comp-cat would introduce the required direct object, 
constraining it to be of type noun-phrase, while types like sbase-comp- 
cat would introduce the baseform (bare stem) complement and constraint its 
vform to be the baseform. 

Extensions to Typing 

Typed feature structures can be extended by allowing inheritance with de- 
defaults faults. Default systems have been mainly used in lexical type hierarchies 
of the sort described in the previous section, in order to encode generaliza¬ 
tions and subregular exceptions to them. In early versions of default unifi- 
union ity cation the operation was order-dependent, based on the priority union op¬ 
eration (Kaplan, 1987). More recent architectures, such as Lascarides and 
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tr-sfin-comp-cat base-comp-cat intr-swh-comp-cat i n t r-s i n f-com p -ca t 

tr-swh-comp-cat intr-sfin-comp-cat intr-sbase-comp-cat 


Figure 11.14 Part of the type hierarchy for the verb type verb-cat, showing 
the subtypes of the comp-cat type. These are all subcategories of verbs which 
take sentential complements. After Sanfilippo (1993). 


Copestake (1997) default unification for typed feature structures, are order- 
independent, drawing on Young and Rounds (1993) and related to Reiter’s 
default logic (Reiter, 1980). 

Many unification-based theories of grammar, including HPSG (Pollard 
and Sag, 1987, 1994) andLFG (Bresnan, 1982) use an additional mechanism 
besides inheritance for capturing lexical generalizations, the lexical rule, lexical rule 
Lexical rules express lexical generalizations by allowing a reduced, hence 
more redundant-free lexicon to be automatically expanded by the rules. Pro¬ 
posed originally by Jackendoff (1975), see Pollard and Sag (1994) for exam¬ 
ples of modem lexical rules, Carpenter (1991) for a discussion of complexity 
issues, and Meurers and Minnen (1997) for a recent efficient implementa¬ 
tion. Some authors have proposed using the type hierarchy to replace lexical 
rules, either by adding abstract types and some disjunctions Krieger and Ner- 
bonne (1993) or via type underspecification and dynamic typing, in which underspeci 
underspecified types are combined to make new types on-line (Koenig and typing 10 
Jurafsky, 1995). 

Types can also be used to represent constituency. Rules like (11.13) 
on page 407 used a normal phrase structure rule template and added the fea¬ 
tures via path equations. Instead, it’s possible to represent the whole phrase 
structure rule as a type. In order to do this, we need a way to represent con¬ 
stituents as features. One way to do this, following Sag and Wasow (1999), is 
to take a type phrase which has a feature called dtrs (‘daughters’), whose 
value is a list of phrases. For example the phrase I love New York could have 
the following representation, (showing only the dtrs feature): 


phrase 





CAT VP 

DTRS ( 

CAT PRO 


r- 

ORTH I 


DTRS ( 


L J 



CAT V 
ORTH LOVE 


CAT NP > 

orth New York ' 
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Other Extensions to Unification 

There are many other extensions to unification besides typing, including 
inequations P a th inequations (Moshier, 1988; Carpenter, 1992; Carpenter and Penn, 
negation 1994) negation (Johnson, 1988, 1990), set-valued features (Pollard and 

features 0 Moshier, 1990), and disjunction Kay (1979), Kasper and Rounds (1986). 

disjunction In some unification systems these operations are incorporated into feature 
structures. Kasper and Rounds (1986) and others, by contrast, implement 
them in a separate metalanguage which is used to describe feature structures. 
This idea derives from the work of Pereira and Shieber (1984), and even ear¬ 
lier work by Kaplan and Bresnan (1982), all of whom distinguished between 
a metalanguage for describing feature structures and the actual feature struc¬ 
tures themselves. The descriptions may thus use negation and disjunction to 
describe a set of feature structures (i.e. a certain feature must not contain a 
certain value, or may contain any of a set of values), but an actual instance 
of a feature structure that meets the description would not have negated or 
disjoint values. 

11.7 Summary 

This chapter introduced feature structures and the unification operation which 
is used to combine them. 

• A feature structure is a set of features-value pairs, where features are 
unanalyzable atomic symbols drawn from some finite set, and val¬ 
ues are either atomic symbols or feature structures. They are repre¬ 
sented either as attribute-value matrices (AVMs) or as acyclic graphs 
(DAGs), where features are directed labeled edges and feature values 
are nodes in the graph. 

• Unification is the operation for both combining information (merging 
the information content of two feature structures) and comparing in¬ 
formation (rejecting the merger of incompatible features). 

• A phrase-structure rule can be augmented with feature structures, and 
with feature constraints expressing relations among the feature struc¬ 
tures of the constituents of the rule. Subcategorization constraints can 
be represented as feature structures on head verbs (or other predicates). 
The elements which are subcategorized for by a verb may appear in the 
verb phrase or may be realized apart from the verb, as a long-distance 
dependency. 
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• Feature structures can be typed. The resulting typed feature struc¬ 
tures place constraints on which type of values a given feature can 
take, and can also be organized into a type hierarchy to capture gen¬ 
eralizations across types. 


Bibliographical and Historical Notes 

The use of features in linguistic theory comes originally from phonology. 
Anderson (1985) credits Jakobson (1939) with being the first to use features 
(called distinctive features) as an ontological type in a theory, drawing on 
previous uses of features by Trubetskoi (1939) and others. The semantic use 
of features followed soon after; see Chapter 16 for the history of componen- 
tial analysis in semantics. Features in syntax were well established by the 
50s and were popularized by Chomsky (1965). 

The unification operation in linguistics was developed independently 
by Kay (1979) (feature structure unification) and Colmerauer (1970, 1975) 
(term unification). Both were working in machine translation and looking 
for a formalism for combining linguistic information which would be re¬ 
versible. Colmerauer’s original Q-system was a bottom-up parser based on 
a series of rewrite rules which contained logical variables, designed for a 
English to French machine translation system. The rewrite rules were re¬ 
versible to allow them to work for both parsing and generation. Colmerauer, 
Fernand Didier, Robert Pasero, Philippe Roussel, and Jean Trudel designed 
the Prolog language based on extended Q-systems to full unification based 
on the resolution principle of Robinson (1965), and implemented a French 
analyzer based on it (Colmerauer and Roussel, 1996). The modem use of 
Prolog and term unification for natural language via Definite Clause Gram¬ 
mars was based on Colmerauer’s (1975) metamorphosis grammars, and was 
developed and named by Pereira and Warren (1980). Meanwhile Martin Kay 
and Ron Kaplan had been working with ATN grammars. In an ATN analysis 
of a passive, the first NP would be assigned to the subject register, then when 
the passive verb was encountered, the value would be moved into the object 
register. In order to make this process reversible, they restricted assignments 
to registers so that certain registers could only be filled once, i.e. couldn’t be 
overwritten once written. They thus moved toward the concepts of logical 
variables without realizing it. Kay’s original unification algorithm was de¬ 
signed for feature structures rather than terms (Kay, 1979). The integration 



440 


Chapter 11. Features and Unification 


of unification into an Earley-style approach given in Section 11.5 is based 
on (Shieber, 1985b). 

See Shieber (1986) for a clear introduction to unification, and Knight 
(1989) for a multidisciplinary survey of unification. 

Inheritance and appropriateness conditions were first proposed for lin¬ 
guistic knowledge by Bobrow and Webber (1980) in the context of an ex¬ 
tension of the KL-ONE knowledge representation system (Brachman and 
Schmolze, 1985b). Simple inheritance without appropriateness conditions 
was taken up by number of researchers; early users include Jacobs (1985) 
& (1987) and Flickinger et al (1985). Ai‘t-Kaci (1984) borrowed the no¬ 
tion of inheritance in unification from the logic programming community. 
Typing of feature structures, including both inheritance and appropriateness 
conditions, was independently proposed by Calder (1987), Pollard and Sag 
(1987), and Elhadad (1990). Typed feature structures were formalized by 
King (1989) and Carpenter (1992). There is an extensive literature in the 
use of type hierarchies in linguistics, particularly for capturing lexical gen¬ 
eralizations; besides the papers previously discussed, the interested reader 
should consult Evans and Gazdar (1996) for a description of the DATR lan¬ 
guage, designed for defining inheritance networks for linguistic knowledge 
representation, Fraser and Hudson (1992) for the use of inheritance in a de¬ 
pendency grammar and Daelemans et al (1992) for a general overview. For¬ 
malisms and systems for the implementation of constraint-based grammars 
via typed feature structures include PAGE (?), ALE (Carpenter and Penn, 
1994), and ConTroll (Gotz et al , 1997). 

Grammatical theories based on unification include Lexical Functional 
Grammar (LFG) (Bresnan, 1982), Head-Driven Phrase Structure Grammar 
(HPSG) (Pollard and Sag, 1987, 1994), Construction Grammar (Kay and 
Fillmore, 1999), and Unification Categorial Grammar (Uszkoreit, 1986). 


Exercises 

11.1 Draw the DAGs corresponding to the AVMs given in Examples 11.1 
and 11.2. 

11.2 Consider the following BERP examples, focusing on their use of pro¬ 


nouns. 
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I want to spend lots of money. 

Tell me about Chez-Panisse. 

Fd like to take her to dinner. 

She doesn’t like mexican. 

Assuming that these pronouns all belong to the category Pro , write lexical 
and grammatical entries with unification constraints that block the following 
examples. 

*Me want to spend lots of money. 

*Tell I about Chez-Panisse. 

*1 would like to take she to dinner. 

*Her doesn’t like mexican. 

11.3 Draw a picture of the subsumption semilattice corresponding to the 
feature structures in Examples 11.3 to 11.8. Be sure to include the most 
general feature structure []. 

11.4 Consider the following examples. 

The sheep are baaaaing. 

The sheep is baaaaing. 

Create appropriate lexical entries for the words the , sheep , and baaaaing. 
Show that your entries permit the correct assignment of a value to the num¬ 
ber feature for the subjects of these examples, as well as their various 
parts. 

11.5 Create feature structures expressing the different subcat frames for 
while and during shown on page 412. 

11.6 Alter the pseudocode shown in Figure 11.12 so that it performs the 
more radical kind of unification parsing described on page 431. 

11.7 Consider the following problematic grammar suggested by Shieber 
(1985b). 

S T 

(TF)=a 

T\ -$• T 2 A 
{Ti f) = { T 2 f f) 

S A 
A —y a 
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Show the first S state entered into the chart using your modified pre¬ 
dictor from the previous exercise, then describe any problematic behavior 
displayed by predictor on subsequent iterations. Discuss the cause of the 
problem and how in might be remedied. 

11.8 Using the list approach to representing a verb’s subcategorization 
frame, show how a grammar could handle any number of verb subcatego¬ 
rization frames with only the following two VP rules. More specifically, 
show the constraints that would have to be added to these rules to make this 
work. 


VP Verb 

VP VPX 

The solution to this problem involves thinking about a recursive walk down 
a verb’s subcategorization frame. This is a hard problem; you might consult 
Shieber (1986) if you get stuck. 

11.9 Page 437 showed how to use typed feature structure to represent con¬ 
stituency. Use that notation to represent rules 11.13, 11.14, and 11.15 shown 
on page 407. 




LEXICALIZED AND 
PROBABILISTIC PARSING 


Two roads diverged in a yellow wood, 

And sorry I could not travel both 
And be one traveler, long I stood 
And looked down one as far as I could 
To where it bent in the undergrowth ... 

Robert Frost The Road Not Taken 


The characters in Damon Runyon’s short stories are willing to bet “on any 
proposition whatever”, as Runyon says about Sky Masterson in The Idyll of 
Miss Sarah Brown ; from the probability of getting aces back-to-back to the 
odds against a man being able to throw a peanut from second base to home 
plate. There is a moral here for language processing: with enough knowl¬ 
edge we can figure the probability of just about anything. The last three 
chapters have introduced sophisticated models of syntactic structure and its 
parsing. In this chapter we show that it is possible to build probabilistic mod¬ 
els of sophisticated syntactic information and use some of this probabilistic 
information in efficient probabilistic parsers. 

Of what use are probabilistic grammars and parsers? One key contri¬ 
bution of probabilistic parsing is to disambiguation. Recall that sentences 
can be very ambiguous; the Earley algorithm of Chapter 10 could repre¬ 
sent these ambiguities in an efficient way, but was not equipped to resolve 
them. A probabilistic grammar offers a solution to the problem: choose 
the most-probable interpretation. Thus, due to the prevalence of ambiguity, 
probabilistic parsers can play an important role in most parsing or natural- 
language understanding task. 
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12.1 


PCFG 

SCFG 


Another important use of probabilistic grammars is in language mod¬ 
eling for speech recognition or augmentative communication. We saw that 
A-gram grammars were important in helping speech recognizers in predict¬ 
ing upcoming words, helping constrain the search for words. Probabilistic 
versions of more sophisticated grammars can provide additional predictive 
power to a speech recognizer. Indeed, since humans have to deal with the 
same problems of ambiguity as do speech recognizers, it is significant that 
we are finding psychological evidence that people use something like these 
probabilistic grammars in human language-processing tasks (reading, hu¬ 
man speech understanding). 

This integration of sophisticated structural and probabilistic models of 
syntax is at the very cutting edge of the field. Because of its newness, no 
single model has become standard, in the way the context-free grammar has 
become a standard for non-probabilistic syntax. We will explore the field 
by presenting a number of probabilistic augmentations to context-free gram¬ 
mars, showing how to parse some of them, and suggesting directions the 
field may take. The chapter begins with probabilistic context-free gram¬ 
mars (PCFGs), a probabilistic augmentation of context-free grammars, to¬ 
gether with the CYK algorithm, a standard dynamic programming algo¬ 
rithm for parsing PCFGs. We then show two simple extensions to PCFGs 
to handle probabilistic subcategorization information and probabilistic lex¬ 
ical dependencies, give an evaluation metric for evaluating parsers, and then 
introduce some advanced issues and some discussion of human parsing. 

Probabilistic Context-Free Grammars 


The simplest augmentation of the context-free grammar is the Probabilistic 
Context-Free Grammar (PCFG), also known as the Stochastic Context- 
Free Grammar (SCFG), first proposed by Booth (1969). 

Recall that a context-free grammar G is defined by four parameters 
(A, Z,P,S): 

1. a set of nonterminal symbols (or ‘variables’) A 

2. a set of terminal symbols E (disjoint from A) 

3. a set of productions P, each of the form A —> (3, where A is a non¬ 
terminal and (3 is a string of symbols from the infinite set of strings 
(EUA)*. 

4. a designated start symbol S 
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S -4 NP VP 

[.80] 

Del -4 that [.05] the [.80] a 

:[.15] 

S -4 AuxNPVP 

[.15] 

Noun -A book 

[.10] 

S -4 VP 

[.05] 

Noun —> flights 

[.50] 

NP -4 DetNom 

[.20] 

Noun -A meal 

[.40] 

NP Proper-Noun 

[.35] 

Verb -A book 

[.30] 

NP -4 Nom 

[.05] 

Verb -A include 

[.30] 

NP -A Pronoun 

[.40] 

Verb -A want 

[.40] 

Nom -A Noun 

[.75] 

Aux -A can 

[.40] 

Nom -A Noun Nom 

[.20] 

Aux -A does 

[.30] 

Nom -A Proper-Noun Nom 

[.05] 

Aux -A do 

[.30] 

VP -a Verb 

[.55] 

Proper-Noun -A TWA 

[.40] 

VP -a Verb NP 

[.40] 

Proper-Noun -A Denver 

[.40] 

VP -a Verb NP NP 

[.05] 

Pronoun -A yow[.40] /[.60] 



Figure 12.1 A PCFG; a probabilistic augmentation of the miniature En¬ 
glish grammar and lexicon in Figure 10.2. These probabilities are not based 
on a corpus; they were made up merely for expository purposes. 


A probabilistic context-free grammar augments each rule in P with a 
conditional probability: 

\p\ (12.1) 

A PCFG is thus a 5-tuple G = (TV, Z, P, 5, D), where D is a function 
assigning probabilities to each rule in P. This function expresses the proba¬ 
bility p that the given nonterminal A will be expanded to the sequence P; it 
is often referred to as 

P(A^P) 

or as 

P(A^p|A) 

Formally this is conditional probability of a given expansion given the 
left-hand-size nonterminal A. Thus if we consider all the possible expansions 
of a nonterminal, the sum of their probabilities must be 1. Figure 12.1 shows 
a sample PCFG for a miniature grammar with only three nouns and three 
verbs. Note that the probabilities of all of the expansions of a nonterminal 
sum to 1. Obviously in any real grammar there are a great many more rules 
for each nonterminal and hence the probabilities of any particular rule are 
much smaller. 
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How are these probabilities used? A PCFG can be used to estimate a 
number of useful probabilities concerning a sentence and its parse-tree(s). 
For example a PCFG assigns a probability to each parse-tree T (i.e. each 
derivation) of a sentence S. This attribute is useful in disambiguation. 
For example, consider the two parses of the sentence “Can you book TWA 
flights” (one meaning ‘Can you book flights on behalf of TWA’, and the other 
meaning ‘Can you book flights run by TWA) shown in Figure 12.2. 

The probability of a particular parse T is defined as the product of the 
probabilities of all the rules r used to expand each node n in the parse tree: 

p(r,s) = npW«)) (12.2) 

neT 

The resulting probability P(T,S) is both the joint probability of the 
parse and the sentence, and also the probability of the parse P(T). How can 
this be true? First, by the definition of joint probability: 

P{T,S) = P{T)P{S\T) (12.3) 

But since a parse tree includes all the words of the sentence, P(S|T) is 
1. Thus: 

P(T : S) = P{T)P(S\T) = P(T) (12.4) 

The probability of each of the trees in Figure 12.2 can be computed by 
multiplying together each of the rules used in the derivation. For example, 
the probability of the left tree in Figure 12.2a (call it 7]) and the right tree 
(12.2b or T r ) can be computed as follows: 

p(Ti) = .15 * .40* .05 * .05 * .35 * .75 * .40* .40* .40 
*.30* .40* .50 

= 1.5 x 10 -6 (12.5) 

P{T r ) = . 15 * .40 * .40 * .05 * .05 * .75 * .40 * .40 * .40 
*.30 *.40 *.50 

= 1.7 x 10 -6 (12.6) 

We can see that the right tree in Figure 12.2(b) has a higher probability. 
Thus this parse would correctly be chosen by a disambiguation algorithm 
which selects the parse with the highest PCFG probability. 

Let’s formalize this intuition that picking the parse with the highest 
probability is the correct way to do disambiguation. The disambiguation 
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can you book TWA flights can you book TWA flights 
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Figure 12.2 Two parse trees for an ambiguous sentence. Parse (a) corre¬ 
sponds to the meaning ‘Can you book flights on behalf of TWA?’, parse (b) to 
‘Can you book flights which are run by TWA. 


algorithm picks the best tree for a sentence S out of the set of parse trees 
for S (which we’ll call t(S)). We want the parse tree T which is most likely 
given the sentence S. 

f(S) = argmaxP(7|S) (12.7) 

Tez(s) 

By definition the probability P(T|S) can be rewritten as P(T,S)/P(S), thus 
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PREFIX 


CONSISTENT 


leading to: 

T(S) = argmax (12.8) 

Tex{s) 

Since we are maximizing over all parse trees for the same sentence, 
P(S ) will be a constant for each tree, and so we can eliminate it: 

f(S) = argmaxP(T,S) (12.9) 

Te%{s) 

Furthermore, since we showed above that P(T,S) = P(T), the final 
equation for choosing the most likely parse simplifies to choosing the parse 
with the highest probability: 

f (S) = argmaxP(T) (12.10) 

Te%{s) 

A second attribute of a PCFG is that it assigns a probability to the string 
of words constituting a sentence. This is important in language modeling in 
speech recognition, spell-correction, or augmentative communication. The 
probability of an unambiguous sentence is P(T,S) = P(T) or just the prob¬ 
ability of the single parse tree for that sentence. The probability of an am¬ 
biguous sentence is the sum of the probabilities of all the parse trees for the 
sentence: 

ns ) = E p ( T > s ) 02.11> 

T£X(S) 

= E p ( T ) 02.12) 

Te<z(s) 

An additional useful feature of PCFGs for language modeling is that 
they can assign a probability to substrings of a sentence. For example, Je- 
linek and Lafferty (1991) give an algorithm for efficiently computing the 
probability of a prefix of a sentence. This is the probability that the grammar 
generates a sentence whose initial substring is Stolcke (1995) 

shows how the standard Earley parser can be augmented to compute these 
prefix probabilities, and Jurafsky et al (1995) describes an application of a 
version of this algorithm as the language model for a speech recognizer. 

A PCFG is said to be consistent if the sum of the probabilities of all 
sentences in the language equals 1. Certain kinds of recursive rules cause 
a grammar to be inconsistent by causing infinitely looping derivations for 
some sentences. For example a rule S' —> S with probability 1 would lead to 
lost probability mass due to derivations that never terminate. See Booth and 
Thompson (1973) for more details on consistent and inconsistent grammars. 
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Probabilistic CYK Parsing of PCFGs 

The parsing problem for PCFGs is to produce the most-likely parse for a 
given sentence, i.e. to compute 

t (S) = argmaxP(T) (12.13) 

r<ET(s) 

Luckily, the algorithms for computing the most-likely parse are sim¬ 
ple extensions of the standard algorithms for parsing. Chapter 10 introduced 
the use of the Earley algorithm to find all parses for a given input sentence 
and a given context-free grammar. It is possible to augment the Earley al¬ 
gorithm to compute the probability of each of its parses, and thus to find the 
most likely parse. Instead of presenting the probabilistic Earley algorithm 
here, however, we will present the probabilistic CYK (Cocke-Younger- 
Kasami) algorithm. We do this because the probabilistic Earley algorithm is 
somewhat complex to present, and also because the CYK algorithm is worth 
understanding, and we haven’t yet studied it. The reader is thus referred to 
Stolcke (1995) for the presentation of the probabilistic Earley algorithm. 

Where the Earley algorithm is essentially a top-down parser which uses 
a dynamic programming table to efficiently store its intermediate results, the 
CYK algorithm is essentially a bottom-up parser using the same dynamic 
programming table. The fact that CYK is bottom-up makes it more efficient 
when processing lexicalized grammars, as we will see later. 

Probabilistic CYK parsing was first described by Ney (1991), but the 
version of the probabilistic CYK algorithm that we present is adapted from 
Collins (1999) and Aho and Ullman (1972). Assume first that the PCFG is 
in Chomsky normal form; recall from page 344 that a grammar is in CNF if 
it is 8-free and if in addition each production is either of the form A^B C 
or A —> a. The CYK algorithm assumes the following input, output, and data 
structures: 

• Input. 

- A Chomsky normal form PCFG G = {A, L, P, 5, £>}. Assume that 
the |A| nonterminals have indices 1, 2,... |A|, and that the start 
symbol S has index 1. 

- n words W \... w n . 

• Data Structure. A dynamic programming array %[ij,a\ holds the 
maximum probability for a constituent with nonterminal index a span¬ 
ning words i... j . Back-pointers in the area are used to store the links 
between constituents in a parse-tree. 
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• Output. The maximum probability parse will be %[1 1]: the parse 

tree whose root is S and which spans the entire string of words w\...w n . 

Like the other dynamic programming algorithms (minimum edit dis¬ 
tance, Forward, Yiterbi, and Earley), the CYK algorithm fills out the prob¬ 
ability array by induction. In this description, we will use Wij, to mean the 
string of words from word i to word j, following Aho and Ullman (1972): 

• base case: Consider the input strings of length one (i.e. individual 
words Wi). In Chomsky normal form, the probability of a given nonter¬ 
minal A expanding to a single word w; must come only from the rule 
A Wi (since A => Wi if and only if A —> Wi is a production). 

• recursive case: For strings of words of length > 1, A wtj if and only 
if there is at least one rule A —^ BC and some fc, 1 < k < j, such that B 
derives the first k symbols of Wij and C derives the last j — k symbols 
of Since each of these strings of words is shorter than the original 
string w^ 9 their probability will already be stored in the matrix %. We 
compute the probability of by multiplying together the probability 
of these two pieces. But there may be multiple parses of w^, and so 
we’ll need to take the max over all the possible divisions of w^ (i.e. 
over all values of k and over all possible rules). 

Figure 12.3 gives pseudocode for this probabilistic CYK algorithm, 
again adapted from Collins (1999) and Aho and Ullman (1972). 


Learning PCFG probabilities 


TREEBANK 


Where do PCFG probabilities come from? There are two ways to assign 
probabilities to a grammar. The simplest way is to use a corpus of already- 
parsed sentences. Such a corpus is called a treebank. For example the Penn 
Treebank (Marcus et al , 1993), distributed by the Linguistic Data Consor¬ 
tium, contains parse trees for the Brown Corpus, one million words from 
the Wall Street Journal, and parts of the Switchboard corpus. Given a tree- 
bank, the probability of each expansion of a nonterminal can be computed by 
counting the number of times that expansion occurs and then normalizing. 
Count(a —> p) Count(a —> p) 


P(a-t P|a) = 


LyCount(a —> y) Count(a) 


(12.14) 


When a treebank is unavailable, the counts needed for computing PCFG 
probabilities can be generated by first parsing a corpus. If sentences were 
unambiguous, it would be as simple as this: parse the corpus, increment a 
counter for every rule in the parse, and then normalize to get probabilities. 
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function CYK (words, grammar) returns best sparse 
Create and clear p [num.wordsjmm.wordsjiumjionterminals\ 

# base case 

for i= 1 to num.words 

for A = 1 to num-nonterminals 
if A —y Wi is in grammar then 
7l [i, z, A] = P(A —> Wi) 

# recursive case 

for / = 2 to nuM-Words 
for i= 1 to num.words-j+1 
for k = 1 to j- 1 

for A = 1 to num.nonterminals 
for B = 1 to num .nonterminals 
for C = 1 to numjionterminals 
prob = n[i,k,B] x C] x P(A—>BC) 

if (prob > %[i,j,A]) then 
= prob 

B[iJ,A] = {k,A,B} 


Figure 12.3 The Probabilistic CYK algorithm for finding the maximum 
probability parse of a string of num.words words given a PCFG grammar with 
numjrules rules in Chomsky Normal Form. B is the array of back-pointers 
used to recover the best parse. After Collins (1999) and Aho and Ullman 
(1972). 


However, since most sentences are ambiguous, in practice we need to keep 
a separate count for each parse of a sentence and weight each partial count 
by the probability of the parse it appears in. The standard algorithm for 
computing this is called the Inside-Outside algorithm, and was proposed ouIsFde 
by Baker (1979) as a generalization of the forward-backward algorithm of 
Chapter 7. See Manning and Schiitze (1999) for a complete description of 
the algorithm. 


12.2 Problems with PCFGs 


While probabilistic context-free grammars are a natural extension to context- 
free grammars, they have a number of problems as probability estimators. 
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Because of these problems, most current probabilistic parsing models use 
some augmentation of PCFGs rather than using vanilla PCFGs. This section 
will summarize problems with PCFGs in modeling structural dependencies 
and in modeling lexical dependencies. 

One problem with PCFGs comes from their fundamental independence 
assumption. By definition, a CFG assumes that the expansion of any one 
nonterminal is independent of the expansion of any other nonterminal. This 
independence assumption is carried over in the probabilistic version; each 
PCFG rule is assumed to be independent of each other rule, and thus the rule 
probabilities are multiplied together. But an examination of the statistics of 
English syntax shows that sometimes the choice of how a node expands is de¬ 
pendent on the location of the node in the parse tree. For example, consider 
the differential placement in a sentence of pronouns versus full lexical noun 
phrases. Beginning with Kuno (1972), many linguists have shown that there 
is a strong tendency in English (as well as in many other languages) for the 
syntactic subject of a sentence to be a pronoun. This tendency is caused by 
the use of subject position to realize the ‘topic’ or old information in a sen¬ 
tence (Givon, 1990). Pronouns are a way to talk about old information, while 
non-pronominal (‘lexical’) noun-phrases are often used to introduce new ref¬ 
erents. For example, Francis etal (1999) show that of the 31,021 subjects of 
declarative sentences in Switchboard, 91% are pronouns (12.15a), and only 
9% are lexical (12.15b). By contrast, out of the 7,489 direct objects, only 
34% are pronouns (12.16a), and 66% are lexical (12.16b). 

(12.15) (a) She’s able to take her baby to work with her. 

(b) Uh, my wife worked until we had a family. 

(12.16) (a) Some laws absolutely prohibit it. 

(b) All the people signed confessions. 

These dependencies could be captured if the probability of expanding 
an NP as a pronoun (for example via the rule NP —> Pronoun ) versus a lexical 
NP (for example via the rule NP —> DetNoun) were dependent on whether 
the NP was a subject or an object. But this is just the kind of probabilistic 
dependency that a PCFG does not allow. 

An even more important problem with PCFGs is their lack of sensitiv¬ 
ity to words. Lexical information in a PCFG can only be represented via the 
probability of pre-terminal nodes (Verb, Noun , Dei) to be expanded lexically. 
But there are a number of other kinds of lexical and other dependencies that 
turn out to be important in modeling syntactic probabilities. For example 
a number of researchers have shown that lexical information plays an im- 
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portant role in selecting the correct parsing of an ambiguous prepositional- 
phrase attachment (Ford et al. , 1982; Whittemore et al. , 1990; Hindle and 
Rooth, 1991, inter alia). Consider the following example from Hindle and 
Rooth (1991): 


(12.17) Moscow sent more than 100,000 soldiers into Afghanistan... 


Here the preposition phrase into Afghanistan can be attached either to 
the NP more than 100,000 soldiers or to the verb-phrase headed by sent. 
In a PCFG, the attachment choice comes down to the choice between two 
rules: NP —> NPPP (NP-attachment) and VP —► NPPP (VP-attachment). 
The probability of these two rules depends on the training corpus; Hindle and 
Rooth (1991) report that NP-attachment happens about 67% compared to 
33% for VP-attachment in 13 million words from the AP newswire; Collins 
(1999) reports 52% NP-attachment in a corpus containing a mixture of Wall 
Street Journal and I.B.M. computer manuals. Whether the preference is 52% 
or 67%, crucially in a PCFG this preference is purely structural and must be 
the same for all verbs. 

In (12.17), however, the correct attachment is to the verb; in this case 
because the verb send subcategorizes for a destination, which can be ex¬ 
pressed with the preposition into. Indeed all of the cases of ambiguous into- 
PP-attachments with the main verb send in the Penn Treebank’s Brown and 
Wall Street Journal corpora attached to the verb. Thus a model which kept 
separate lexical dependency statistics for different verbs would be able to 
choose the correct parse in these cases. 

Coordination ambiguities are another case where lexical dependencies 
are the key to choosing the proper parse. Figure 12.4 shows an example 
from Collins (1999), with two parses for the phrase dogs in houses and cats. 
Because dogs is semantically a better conjunct for cats than houses (and be¬ 
cause dogs can’t fit inside cats) the parse [dogs in l^p houses and cats]] 
is intuitively unnatural and should be dispreferred. The two parses in Fig¬ 
ure 12.4, however, have exactly the same PCFG rules and thus a PCFG will 
assign them the same probability. 

In summary, probabilistic context-free grammars have a number of in¬ 
adequacies as a probabilistic model of syntax. In the next section we sketch 
current methods for augmenting PCFGs to deal with these issues. 


LEXICAL 

DEPENDENCY 
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NP (b) NP 

Conj NP NP 


and Noun Noun Prep 



NP 

cats 

dogs in NP 

Conj 

NP 

Noun 


Noun 

and 

Noun 

houses 


houses 


cats 


Figure 12.4 An instance of coordination ambiguity. Although the left 
structure is intuitively the correct one, a PCFG will assign them identically 
probabilities since both structure use the exact same rules. After Collins 
(1999) 


12.3 Probabilistic Lexicalized CFGs 

We saw in Chapter 11 that syntactic constituents could be associated with a 
lexical head. This idea of a head for each constituent dates back to Bloom¬ 
field (1914), but was first used to extend PCFG modeling by Black et al 
(1992). The probabilistic representation of lexical heads used in recent parsers 
such as Chamiak (1997) and Collins (1999) is simpler than the complex 
head-feature models we saw in Chapter 11. In the simpler probabilistic rep¬ 
resentation, each nonterminal in a parse-tree is annotated with a single word 
which is its lexical head. Figure 12.5 shows an example of such a tree from 
Collins (1999), in which each nonterminal is annotated with its head. “Work¬ 
ers dumped sacks into a bin” is a shortened form of a WSJ sentence. 

In order to generate such a tree, each PCFG rule must be augmented to 
identify one right-hand-side constituent to be the head daughter. The head¬ 
word for a node is then set to the headword of its head daughter. Choosing 
these head daughters is simple for textbook examples (NN is the head of 
NP), but is complicated and indeed controversial for most phrases (should 
the complementizer to or the verb be the head of an infinite verb-phrase?). 
Modem linguistic theories of syntax generally include a component that de¬ 
fines heads (see for example Pollard and Sag, 1994). Collins (1999) also 
gives a description of a practical set of head mles for Penn Treebank gram¬ 
mars modified from Magerman; for example their rule for finding the head 
of an NP is to return the very last word in the NP if it is tagged POS (posses- 
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S (dumped) 

NP(workers) VP(dumped) 

|NNS(workers) VBD(dumped) NP(sacks) PP(into) 

NNS (sacks) P(into) NP(bin) 

DT(a) NN(bin) 

workers dumped sacks into I bin 


Figure 12.5 A lexicalized tree from Collins (1999). 


sive); else to search from right to left in the NP for the first child which is an 
NN, NNP, NNPS, NNS, NX, POS, or JJR; else to search from left to right 
for the first child which is an NP. 

One way to think of these head features is as a simplified version of 
the head features in a unification grammar; instead of complicated re-entrant 
feature values, we just allow an attribute to have a single value from a finite 
set (in fact the set of words in the vocabulary). Technically, grammars in 
which each node is annotated by non-recursive features are called attribute 
grammars. 

Another way to think of a lexicalized grammar is as a simple context- 
free grammar with a lot more rules; it’s as if we created many copies of each 
rule, one copy for each possible headword for each constituent; this idea of 
building a lexicalized grammar is due to Schabes et al (1988) and Schabes 
(1990). In general there may be too many such rules to actually keep them 
around, but thinking about lexicalized grammars this way makes it clearer 
that we can parse them with standard CFG parsing algorithms. 

Let’s now see how these lexicalized grammars can be augmented with 
probabilities, and how by doing so we can represent the kind of lexical de¬ 
pendencies we discussed above and in Chapter 9. Suppose we were to treat a 
probabilistic lexicalized CFG like a normal but huge PCFG. Then we would 
store a probability for each rule/head combination, as in the following con¬ 
trived examples: 


ATTRIBUTE 

GRAMMARS 


VP(dumped) -A VBD(dumped) NP(sacks) PP(into) [3xl(T 10 ] 
VP(dumped) —> VBD(dumped)NP(cats) PP(into) [sxicr 11 ] 
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VP(dumped) —> VBD(dumped)NP(hats) PP(into) [4xl(r 10 ] 

VP(dumped) VBD(dumped)NP(sacks) PP(above) [lxio -12 ] 

(12.18) 

The problem with this method, of course, is that there is no corpus 
big enough to train such probabilities. Training standard PCFG probabilities 
would result in zero counts for almost all the rules. To avoid this, we need to 
make some simplifying independence assumptions in order to cluster some 
of the counts. 

Perhaps the main difference between various modem statistical parsers 
lies in exactly which independence assumptions they make. In the rest of this 
section we describe a simplified version of Chamiak’s (1997) parser, but we 
could also have chosen any of the other similar dependency-based statistical 
parsers (such as Magerman (1995), Collins (1999), and Ratnaparkhi (1997)). 

Like many of these others, Chamiak’s parser incorporates lexical de¬ 
pendency information by relating the heads of phrases to the heads of their 
constituents. His parser also incorporates syntactic subcategorization infor¬ 
mation by conditioning the probability of a given rule expansion of a non¬ 
terminal on the head of the nonterminal. Let’s look at examples of slightly 
simplified versions of the two kinds of statistics (simplified by being condi¬ 
tioned on less factors than in Chamiak’s complete algorithm). 

First, recall that in a vanilla PCFG, the probability of a node n being 
expanded via rule r is conditioned on exactly one factor: the syntactic cat¬ 
egory of the node n. (For simplicity we will use the notation n to mean the 
syntactic category of n.) We will simply add one more conditioning factor: 
the headword of the node h(n). Thus we will be computing the probability 

p(r(n)\n, h(n )) (12.19) 

Consider for example the probability of expanding the VP in Figure 12.5 
via the rule r, which is: 

VP -> VBDNPPP 

This probability is p(r\VP, dumped ), answering the question “What is 
the probability that a VP headed by dumped will be expanded as VBD NP 
PPT\ This lets us capture subcategorization information about dumped ; for 
example, a VP whose head is dumped may be more likely to have an NP and 
a PP than a VP whose head is slept. 

Now that we have added heads as a conditioning factor, we need to 
decide how to compute the probability of a head. The null assumption would 
make all heads equally likely; the probability that the head of a node would 
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be sacks would be the same as the probability that the head would be racks. 
This doesn’t seem very useful. The syntactic category of the node ought 
to matter (nouns might have different kinds of heads than verbs). And the 
neighboring heads might matter too. Let’s condition the probability of a 
node n having a head h on two factors: the syntactic category of the node n , 
and the head of the node’s mother h{m{n)). This is the probability 

p(h(n) = wordi\n,h(m(n))) (12.20) 

Consider for example the probability that the NP that is the second 
daughter of the VP in Figure 12.5 has the head sacks. The probability of this 
head is p(head(n) = sacks\n = NP,h(m(n)) = dumped). This probability 
answers the question “What is the probability that an NP whose mother’s 
head is dumped has the head sacksT\ sketched in the following drawing: 

X(dumped) 

NP(?sacks?) 

The figure shows that what this head-probability is really doing is cap¬ 
turing dependency information e.g. between the words dumped and sacks. 

How are these two probabilities used to compute the probability of 
a complete parse? Instead of just computing the probability of a parse by 
multiplying each of the PCFG rule probabilities, we will modify equation 
(12.2) by additionally conditioning each rule on its head: 

P(T,S) = Y[p( r ( n )\ n iH n )) x p{h(n)\n,h(m(n))) (12.21) 

n£T 

Let’s look at a sample parse-ambiguity to see if these lexicalized prob¬ 
abilities will be useful in disambiguation. Figure 12.6 shows an alternative 
(incorrect) parse for the sentence “Workers dumped sacks into a bin”, again 
from Collins (1999). In this incorrect parse the PP into a bin modifies the 
NP sacks instead of the VP headed by dumped. This parse is incorrect be¬ 
cause into a bin is extremely unlikely to be a modifier of this NP; it is much 
more likely to modify dumped , as in the original parse in Figure 12.5. 

The head-head and head-rule probabilities in equation (12.21) will in¬ 
deed help us correctly choose the VP attachment (Figure 12.5) over the 
NP attachment (Figure 12.6). One difference between the two trees is that 
VP(dumped) expands to VBD NP PP in the correct tree and VBD NP in the 
incorrect tree. Let’s compute both of these by counting in the Brown corpus 
portion of the Penn Treebank. The first rule is quite likely: 


p(VP^VBDNPPP\VP,dumped) 
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S (dumped) 

NP(workers) YP(dumped) 

NNS(workers) VBD(dumped) NP(sacks) 

NP(sacks) PP(into) 

NNS(sacks) P(into) NP(bin) 

DT(a) NN(bin) 

workers dumped sacks into I bin 


Figure 12.6 An incorrect parse of the sentence in Figure 12.5 from Collins 
(1999) 


_ C(VP(dumped) VBDNPPP) 

Ip C(VP(dumped)->$) 

= 1 = -67 (12.22) 

The second rule never happens in the Brown corpus. In practice this 
zero value would be smoothed somehow, but for now let’s just notice that the 
first rule is preferred. This isn’t surprising, since dump is a verb of caused- 
motion into a new location: 


p(VP -A VBDNP\ VP, dumped) 


C(VP(dumped) VBDNP) 

Ip C(VP(dumped) -A (3) 

jj=0 (12.23) 


What about the head probabilities? In the correct parse, a PP node 
whose mother’s head is dumped has the head into. In the incorrect, a PP 
node whose mother’s head is sacks has the head into. Once again, let’s use 
counts from the Brown portion of the Treebank: 


p(into\PP, dumped ) = 


C(X(dumped) —>... PP(into )...) 
Y,f>C(X(dumped) —> ...PP...) 


(12.24) 
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p(into\PP , sacks) 


C(X(sacks) -^ ... PP( into )...) 
C(X(sacks) ...PP...) 


= q=? (12.25) 

Once again, the head probabilities correctly predict that dumped is 
more likely to be modified by into than is sacks. 

Of course, one example does not prove that one method is better than 
another. Furthermore, as we mentioned above, the probabilistic lexical gram¬ 
mar presented above is a simplified version of Chamiak’s actual algorithm. 
He adds additional conditioning factors (such as conditioning the rule-expan¬ 
sion probability on the syncat of the node’s grandparent), and also proposes 
various backoff and smoothing algorithms, since any given corpus may still 
be too small to acquire these statistics. Other statistical parsers include even 
more factors, such as the distinction between arguments and adjuncts and 
giving more weight to lexical dependencies which are closer in the tree than 
those which are further (Collins, 1999), the three left-most parts of speech 
in a given constituent (Magerman and Marcus, 1991), and general structural 
preferences (such as the preference for right-branching structures in English) 
(Briscoe and Carroll, 1993). 

Many of these statistical parsers have been evaluated (on the same cor¬ 
pus) using the methodology of the Methodology Box on page 460. 

Extending the CYK algorithm to handle lexicalized probabilities is left 
as an exercise for the reader. 


12.4 Dependency Grammars 

The previous section showed that constituent-based grammars could be aug¬ 
mented with probabilistic relations between head words, and showed that 
this lexical dependency information is important in modeling the lexical 
constraints that heads (such as verbs) place on their arguments or modifiers. 

An important class of grammar formalisms is based purely on this lex¬ 
ical dependency information itself. In these dependency grammars, con- gramm d a e r n s cy 
stituents and phrase-structure rules do not play any fundamental role. In¬ 
stead, the syntactic structure of a sentence is described purely in terms of 
words and binary semantic or syntactic relations between these words (called 
lexical dependencies), Dependency grammars often draw heavily from the pendencies 
work of Tesniere (1959), and the name dependency was presumably first dependency 
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Methodology Box: Evaluating Parsers 


The standard techniques for evaluating parsers and grammars are 
called the PARS EVA L measures, and were proposed by Black et al. 
(1991) based on the same ideas from signal-detection theory that we 
saw in earlier chapters. In the simplest case, a particular parsing of 
the test set (for example the Penn Trccbank) is defined as the correct 
parse. Given this ‘gold standard' for a test set, a given constituent in 
a candidate parse c of a sentence ,y is labeled ‘correctly* if there is a 
constituent in the tieebank parse with the same starting point, ending 
point, and nonterminal symbol. We can then measure the precision, 
recall, and a new metric (crossing brackets) for each sentence x: 


lnhr l rrt rwnll- — # of correct constituents in candidate parse of .v 
rec * — # of correct constituents in trccbank parse ofT 


labeled precision: 


# of correct constituents in candidate parse of s 
# of total constituents in candidate parse of s' 


cross-brackets: the number of crossed brackets (e.g. the number 
of constituents for which the treebank has a bracketing such as 
((A B) C) but the candidate parse has a bracketing such as (A 
(B C))). 


Using a portion of the Wall Street Journal trccbank as the test 
set, parsers such as Chamiak (1997) and Collins (1999) achieve just 
under 90% recall, just under 90%; precision, and about 1% cross- 
bracketed constituents per sentence. 

For comparing parsers which use different grammars, the PAR- 
SEVAL metric includes a canonicalization algorithm for removing 
information likely to be grammar-specific (auxiliaries, pre-infinitival 
"to”, etc) and computing a simplified score. The interested reader 
should see Black et al. (1991), There are also related evaluation met¬ 
rics for dependency parses (Collins et al., 1999) and dependency- 
based metrics which work for any parse structure (Lin, 1995; Carroll 
et al., 1998). 

For grammar-checking, we can compute instead the precision 
and recall of a simpler task: how often the parser correctly rejected 
an ungrammatical sentence (or recognized a grammatical sentence). 
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used by David Hays. But this lexical dependency notion of grammar is in 
fact older than the relatively recent phrase-structure or constituency gram¬ 
mars, and has its roots in the ancient Greek and Indian linguistic traditions. 
Indeed the notion in traditional grammar of ‘parsing a sentence into subject 
and predicate’ is based on lexical relations rather than constituent relations. 



Figure 12.7 shows an example parse of the sentence I gave him my ad¬ 
dress , using the dependency grammar formalism of Jarvinen and Tapanainen 
(1997) and Karlsson el al (1995). Note that there are no non-terminal or 
phrasal nodes; each link in the parse tree holds between two lexical nodes 
(augmented with the special <root> node). The links are drawn from 
a fixed inventory of around 35 relations, most of which roughly represent 
grammatical functions or very general semantic relations. Other dependency- 
based computational grammars, such as Link Grammar (Sleator and Tem- 


perley, 1993), use different but roughly overlapping links. The following 
table shows a few of the relations used in Jarvinen and Tapanainen (1997): 

Dependency Description 

subj 

syntactic subject 

obj 

direct object (incl. sentential complements) 

dat 

indirect object 

pcomp 

complement of a preposition) 

comp 

predicate nominals (complements of copulas) 

tmp 

temporal adverbials 

loc 

location adverbials 

attr 

premodifying (attributive) nominals (genitives, etc) 

mod 

nominal postmodifiers (prepositional phrases, adjectives) 


UNK 

GRAMMAR 
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FREE WORD 
ORDER 


LINK 

GRAMMAR 


We have already discussed why dependency information is important. 
Is there any advantage to using only dependency information and ignoring 
constituency? Dependency grammar researchers argue that one of the main 
advantages of pure dependency grammars is their ability to handle languages 
with relatively free word order. For example the word order in languages 
like Czech is much more flexible than in English; an object might occur be¬ 
fore or after a location adverbial or a comp. A phrase-structure grammar 
would need a separate rule for each possible place in the parse tree that such 
an adverbial phrase could occur. A dependency grammar would just have 
one link-type representing this particular adverbial relation. Thus a depen¬ 
dency grammar abstracts away from word-order variation, representing only 
the information that is necessary for the parse. 

There are a number of computational implementations of dependency 
grammars; Link Grammar (Sleator and Temperley, 1993) and Constraint 
Grammar (Karlsson etal, 1995) are easily-available broad-coverage depen¬ 
dency grammars and parsers for English. Dependency grammars are also 
often used for other languages. Hajic (1998), for example, describes the 
500,000 word Prague Dependency Treebank for Czech which has been used 
to train probabilistic dependency parsers (Collins et al, 1999). 

Categorial Grammar 

Categorial grammars were first proposed by Adjukiewicz (1935), and mod¬ 
ified by Bar-Hillel (1953), Lambek (1958), Dowty (1979), Ades and Steed- 
man (1982), and Steedman (1989) inter alia. See Bach (1988) for an intro¬ 
duction and the other papers in Oehrle et al (1988) for a survey of extensions 
to the basic models. We will describe a simplified version of the combina¬ 
tory categorial grammar of (Steedman, 1989). A categorial grammar has 
two components. The categorial lexicon associates each word with a syn¬ 
tactic and semantic category. The combinatory rules allow functions and 
arguments to be combined. There are two types of categories: functors and 
arguments. Arguments, like nouns, have simple categories like N. Verbs 
or determiners act more like functors. For example, a determiner can be 
thought of as a function which applies to a N on its right to produce a NP. 
Such complex categories are built using the X/Y and X\Y operators. X/Y 
means a function from Y to X, i.e. something which combines with a Y 
on its right to produce an X. Determiners thus receive the category NP/N: 
something which combines with an N on its right to produce an NP. Similar, 
transitive verbs might have the category VP/NP; something which combines 
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with a NP on the right to produce a YP. Ditransitive verbs like give might 
have the category (VP/NP)/NP; something which combines with a NP on its 
right to yield a transitive verb. The simplest combination rules just combine 
an X/Y with a Y on its right to produce and X or a X\Y with a Y on its left 
to produce and X. 

Consider the simple sentence Harry eats apples from Steedman (1989). 
Instead of using a primitive VP category, let’s assume that a finite verb phrase 
like eat apples has the category (S\NP); something which combines with an 
NP on the left to produce a sentence. Harry and apples are both NPs. Eats is 
a finite transitive verb which combines with an NP on the right to produce a 
finite YP: (S\backslashNP)/NP. The derivation of S proceeds as follows: 


(12.26) 


Harry eats apples 

NP (S\NP)/NP NP 
S\NP 
S 


Modem categorial grammars include more complex combinatory rules 
which are needed for coordination and other complex phenomena, and also 
include composition of semantic categories as well as syntactic ones. See 
Chapter 15 for a discussion of semantic composition, and the above-mentioned 
references for more details about categorial grammar. 


12.5 Human Parsing 

How do people parse? Do we have evidence that people use any of the 
models of grammar and parsing developed over the last 4 chapters? Do 
people use probabilities to parse? The study of human parsing (often called 
human sentence processing) is a relatively new one, and we don’t yet have 
complete answers to these questions. But in the last 20 years we have learned 
a lot about human parsing; this section will give a brief overview of some 
recent theories. These results are relatively recent, however, and there is still 
disagreement over the correct way to model human parsing, so the reader 
should take some of this with a grain of salt. 

An important component of human parsing is ambiguity resolution. 
How can we find out how people choose between two ambiguous parses of 
a sentence? As was pointed out in this chapter and in Chapter 9, while al¬ 
most every sentence is ambiguous in some way, people rarely notice these 
ambiguities. Instead, they only seem to see one interpretation for a sentence. 


SENTENCE 

PROCESSING 
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Following a suggestion by Fodor (1978), Ford et al (1982) used this fact 
to show that the human sentence processor is sensitive to lexical subcate¬ 
gorization preferences. They presented subjects with ambiguous sentences 
like (12.27-12.28), in which the preposition phrase on the beach could at¬ 
tach either to a noun phrase ( the dogs) or a verb phrase. They asked the 
subjects to read the sentence and check off a box indicating which of the two 
interpretations they got first. The results are shown after each sentence: 

(12.27) The women kept the dogs on the beach 

• The women kept the dogs which were on the beach. 5% 

• The women kept them (the dogs) on the beach. 95% 

(12.28) The women discussed the dogs on the beach 

• The women discussed the dogs which were on the beach. 90% 

• The women discussed them (the dogs) while on the beach. 10% 

The results were that subjects preferred VP-attachment with keep and 
NP-attachment with discuss. This suggests that keep has a subcategorization 
preference for a VP with three constituents: (VP VNP PP) while discuss 
has a subcategorization preference for a VP with two constituents: (VP 
VNP ), although both verbs still allow both subcategorizations. 

Much of the more recent ambiguity-resolution research relies on a 
garden path specific class of temporarily ambiguous sentences called garden-path sen¬ 
tences. These sentences, first described by Bever (1970), are sentences 
which are cleverly constructed to have three properties which combine to 
make them very difficult for people to parse: 

1. they are temporarily ambiguous: the sentence is unambiguous, but 
its initial portion is ambiguous. 

2. one of these two parses in the initial portion is somehow preferable to 
the human parsing mechanism. 

3. but the dispreferred parse is the correct one for the sentence. 

The result of these three properties is that people are ‘led down the 
garden path’ towards the incorrect parse, and then are confused when they 
realize it’s the wrong one. Sometimes this confusion is quite conscious, as in 
Bever’s example (12.29); in fact this sentence is so hard to parse that readers 
often need to be shown the correct structure. In the correct structure raced 
is part of a reduced relative clause modifying The horse , and means ‘The 
horse [which was raced past the bam] fell’; this structure is also present in 
the sentence ‘Students taught by the Berlitz method do better when they get 
to France’. 
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(12.29) The horse raced past the bam fell. 



The horse raced past the bam fell 


The horse raced past the bam fell 


(12.30) The complex houses married and single students and their families. 


(a) 




(b) S 



NP YP 




Det Adj N Det N V 

The complex houses The complex houses 

(12.31) The student forgot the solution was in the back of the book. 



The students forgot the solution was The students forgot the solution was 


Other times the confusion caused by a garden-path sentence is so subtle 
that it can only be measured by a slight increase in reading time. For exam¬ 
ple in (12.31) from Trueswell et al (1993) (modified from an experiment by 
Ferreira and Henderson (1991)), readers often mis-parse the solution as the 
direct object of forgot rather than as the subject of an embedded sentence. 
This is another subcategorization preference difference; forgot prefers a di¬ 
rect object (VP —> VNP ) to a sentential complement (VP —> VS). But the 
difference is subtle, and is only noticeable because the subjects spent sig¬ 
nificantly more time reading the word was. How do we know how long a 
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subject takes to read a word or a phrase? One way is by scrolling a sentence 
onto a computer screen one word or phrase at a time; another is by using 
an eye-tracker to track how long their eyes linger on each word. Trueswell 
et al (1993) employed both methods in separate experiments. This ‘mini- 
garden-path’ effect at the word was suggests that subjects had chosen the 
direct object parse and had to re-analyze or rearrange their parse now that 
they realize they are in a sentential complement. By contrast, a verb which 
prefers a sentential complement (like hope) didn’t cause extra reading time 
at was. 

These garden-path sentences are not just restricted to English. (12.32) 
shows a Spanish example from Gilboy and Sopena (1996) in which the word 
que, just like English that, is ambiguous between the relative clause marker 
and the sentential complement marker. Thus up to the phrase dos hijas, 
readers assume the sentence means “the man told the woman that he had 
two daughters”; after reading the second que, they must reparse que tenia 
dos hijas as a relative clause modifier of mujer rather than a complement of 
dijo. 

(12.32) El hombre le dijo a la mujer que tenia dos hijas 

The man her told to the woman that had two daughters 

que la invitaba a cenar. 

that her he invited to dinner. 

‘The man told the woman who had two daughters that (he) would invite her 

for dinner? 

Example (12.33) shows a Japanese garden path from Mazuka and Itoh 
(1995). In this sentence, up to the verb mikaketa (saw), the reader assumes 
the sentence means “Yoko saw the child at the intersection.” But upon read¬ 
ing the word mikaketa ( taxi-DAT ), they they have to reanalyze child not as 
the object of saw , but as the object of put-on. 

(12.33) Yoko-ga kodomo-o koosaten-de mikaketa takusii-ni noseta . 

Yoko-NOM child -acc intersection-LOC saw taxi-DAT put on 

‘Yoko made the child ride the taxi she saw at the intersection.’ 

In the Spanish and Japanese examples, and in examples (12.29) and 
(12.31), the garden path is caused by the subcategorization preferences of 
the verbs. The garden-path and other methodologies have been employed to 
study many kinds of preferences besides subcategorization preferences. Ex¬ 
ample (12.31) from Jurafsky (1996) shows that sometimes these preferences 
have to do with part-of-speech preferences (for example whether houses is 
more likely to be a verb or a noun). Many of these preferences have been 
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shown to be probabilistic and to be related to the kinds of probabilities we 
have been describing in this chapter. MacDonald (1993) showed that the 
human processor is sensitive to whether a noun is more likely to be a head 
or a non-head of a constituent, and also to word-word collocation frequen¬ 
cies. Mitchell et al (1995) showed that syntactic phrase-structure frequen¬ 
cies (such as the frequency of the relative clause construction) play a role 
in human processing. Juliano and Tanenhaus (1993) showed that the hu¬ 
man processor is sensitive to a combination of lexical and phrase-structure 
frequency. 

Besides grammatical knowledge, human parsing is affected by many 
other factors which we will describe later, including resource constraints 
(such as memory limitations, to be discussed in Chapter 13), thematic struc¬ 
ture (such as whether a verb expects semantic agents or patients , to be dis¬ 
cussed in Chapter 16) and semantic, discourse, and other contextual con¬ 
straints (to be discussed in Chapter 15 and Chapter 18). While there is gen¬ 
eral agreement about the knowledge sources used by the human sentence 
processor, there is less agreement about the time course of knowledge use. 
Frazier and colleagues (most recently in Frazier and Clifton, 1996) argue 
that an initial interpretation is built using purely syntactic knowledge, and 
that semantic, thematic, and discourse knowledge only becomes available 
later. This view is often called a modularist perspective; researchers hold¬ 
ing this position generally argue that human syntactic knowledge is a distinct 
module of the human mind. Many other researchers (including MacDonald, 
1994; MacWhinney, 1987; Pearlmutter and MacDonald, 1992; Tabor et al , 
1997; Trueswell and Tanenhaus, 1994; Trueswell et al , 1994) hold an inter- 
actionist perspective, arguing that people use multiple kinds of information 
incrementally. For this latter group, human parsing is an interactive process, 
in which different knowledge sources interactively constrain the process of 
interpretation. 

Researchers such as MacDonald (1993) argue that these constraints are 
fundamentally probabilistic. For example Jurafsky (1996) and Narayanan 
and Jurafsky (1998) showed that a probabilistic model which included PCFG 
probabilities as well as syntactic and thematic subcategorization probabili¬ 
ties could account for garden-path examples such as those in (12.29-12.31) 
above. For example P(N^houses ) is greater than P(V—chouses), and this is 
one of the factors accounting for the processing difficulty of example (12.30) 
above. In the Jurafsky and Narayanan-Jurafsky model, the human language 
processor takes an input sentence, and computes the most-likely interpre¬ 
tation by relying on probabilistic sources of linguistic information. Errors 
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(such as garden-path sentences) are caused by two factors. First, the stored 
probabilities may simply not match the intended interpretation of the speaker 
(i.e. people may just rank the wrong interpretation as the best one). Second, 
people are unwilling or unable to maintain very many interpretations at one 
time. Whether because of memory limitations, or just because they have a 
strong desire to come up with a single interpretation, they prune away low- 
ranking interpretations. Jurafsky and Narayanan-Jurafsky suggest that this 
pruning happens via probabilistic beam search in the human parser (like the 
pruning described in Chapter 7). The result is that they prune away the cor¬ 
rect interpretation, leaving the highest-scoring but incorrect one. 


12.6 Summary 

This chapter has sketched the basics of probabilistic parsing, concentrat¬ 
ing on probabilistic context-free grammars and probabilistic lexicalized 
grammars. 

• Probabilistic grammars assign a probability to a sentence or string of 
words, while attempting to capture more sophisticated syntactic infor¬ 
mation than the A-gram grammars of Chapter 6, 

• A probabilistic context-free grammar (PCFG) is a context-free gram¬ 
mar in which every rule is annotated with the probability of choosing 
that rule. Each PCFG rule is treated as if it were conditionally inde¬ 
pendent; thus the probability of a sentence is computed by multiply¬ 
ing the probabilities of each rule in the parse of the sentence. 

• The Cocke-Younger-Kasami (CYK) algorithm is a bottom-up dy¬ 
namic programming parsing algorithm. Both the CYK and Earley can 
be augmented to compute the probability of a parse while they are 
parsing a sentence. 

• PCFG probabilities can be learning by counting in a parsed corpus, or 
by parsing a corpus. The Inside-Outside algorithm is a way of dealing 
with the fact that the sentences being parsed are ambiguous. 

• Probabilistic lexicalized CFGs augment PCFGs with a lexical head 
for each rule. The probability of a rule can then be conditioned on the 
lexical head or nearby heads. 

• Parsers are evaluated using three metrics: labeled recall, labeled pre¬ 
cision, and cross-brackets. 



Section 12.6. Summary 


469 


• There is evidence based on garden-path sentences and other on-line 
sentence-processing experiments that the human parser operates prob¬ 
abilistically and uses probabilistic grammatical knowledge such as sub¬ 
categorization information. 
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Bibliographical and Historical Notes 


Many of the formal properties of probabilistic context-free grammars were 
first worked out by Booth (1969) and Salomaa (1969). Baker (1979) pro¬ 
posed the Inside-Outside algorithm for unsupervised training of PCFG prob¬ 
abilities, which used a CYK-style parsing algorithm to compute inside prob¬ 
abilities. Jelinek and Lafferty (1991) extended the CYK algorithm to com¬ 
pute probabilities for prefixes. Stolcke (1995) drew on both these algorithm 
to adopt the Earley algorithm to PCFGs. 

A number of researchers starting in the early 1990’s worked on adding 
lexical dependencies to PCFGs, and on making PCFG probabilities more 
sensitive to surrounding syntactic structure. Many of these papers were first 
presented at the DARPA Speech and Natural Language Workshop in June, 
1990. A paper by Hindle and Rooth (1990) applied lexical dependencies 
to the problem of attaching preposition phrases; in the question session to 
a later paper Ken Church suggested applying this method to full parsing 
(Marcus, 1990). Early work on such probabilistic CFG parsing augmented 
with probabilistic dependency information includes Magerman and Marcus 
(1991), Black etal (1992), Jones and Eisner (1992), Bod (1993), and Jelinek 
et al (1994), in addition to Collins (1996), Chamiak (1997), and Collins 
(1999) discussed above. 

Probabilistic formulations of grammar other than PCFGs include prob¬ 
abilistic TAG grammar (Resnik, 1992; Schabes, 1992), based on the TAG 
grammars discussed in Chapter 9, probabilistic LR parsing (Briscoe and 
Carroll, 1993), and probabilistic link grammar (Lafferty et al , 1992). An 
approach to probabilistic parsing called supertagging extends the part-of- 
speech tagging metaphor to parsing by using very complex tags that are in 
fact fragments of lexicalized parse trees (Bangalore and Joshi, 1999; Joshi 
and Srinivas, 1994), based on the lexicalized TAG grammars of Schabes 
et al (1988). For example the noun purchase would have a different tag 
as the first noun in a noun compound (where it might be on the left of a 
small tree dominated by Nominal) than as the second noun (where it might 
be on the right). See Goodman (1997) and Abney (1997) for probabilis¬ 
tic treatments of feature-based grammars. Another approach combines the 
finite-state model of parsing described in Chapter 9 with the A-gram, by 
doing partial parsing and then computing A-grams over basic phrases (e.g. 
P(PP\NP )). (Moore et al, 1995; Zechner and Waibel, 1998). A number 
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of probabilistic parsers are based on dependency grammars; see for exam¬ 
ple Chelba et al (1997), Chelba and Jelinek (1998), and Berger and Printz 
(1998); these parsers were also used as language models for speech recogni¬ 
tion. 

Related to probabilistic dependency grammars is the idea of learning 
subcategorization frames for verbs, as well as probabilities for these frames. 
Algorithms which learn non-probabilistic subcategorization frames for verbs 
include the cue-based approach of Brent (1993) and the finite-state automa¬ 
ton approach of Manning (1993). Briscoe and Carroll (1997) extract more 
complex subcategorization frames (using 160 possible subcategorization la¬ 
bels) and also learns subcategorization frame frequencies, using a probabilis¬ 
tic LR parser and some post-processing. Roland and Jurafsky (1998) showed 
that it is important to compute subcategorization probabilities for the word 
sense (‘lemma’) rather than the simple orthographic word. 

Many probabilistic and corpus-based approaches have been taken to 
the preposition-phrase attachment problem since Hindle and Rooth’s study, 
including TBL (Brill and Resnik, 1994), Maximum Entropy (Ratnaparkhi 
et al , 1994), Memory-Based Learning (Jakub and Daelemans, 1997), log- 
linear models (Franz, 1997), and decision trees using semantic distance be¬ 
tween heads (computed from WordNet) (Stetina and Nagao, 1997), as well 
as the use of machine learning techniques like boosting (Abney et al , 1999). 

Manning and Schiitze (1999) is a good advanced textbook on statisti¬ 
cal natural language processing which covers probabilistic parsing. Collins’ 
(1999) dissertation includes a very readable survey of the field and introduc¬ 
tion to his parser. 


Exercises 

12.1 Implement the CYK algorithm. 

12.2 Sketch out how the CYK algorithm would have to be augmented to 
handle lexicalized probabilities. 


12.3 Implement your lexicalized extension of the CYK algorithm. 
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12.4 Implement your lexicalized extension of the CYK algorithm. 

12.5 Implement the PARSEVAL metrics described on page 460. Next ei¬ 
ther use a treebank or create your own hand-checked parsed testset. Now use 
your CFG (or other) parser and grammar and parse the testset and compute 
labeled recall, labeled precision, and cross-brackets. 

12.6 Take any three sentences from Chapter 9 and hand-parse them into 
the dependency grammar formalism of Karlsson et al (1995) shown on page 
461. 




LANGUAGE AND 
COMPLEXITY 


This is the dog , that worried the cat , r&ar killed the rat y that ate 
the malty that lay in the house that Jack built. 

Mother Goose, The house that Jack built 


This is the malt that the rat that the cat that the dog worried 
killed ate. 

Victor H. Yngve (1960) 


Much of the humor in musical comedy and comic operetta comes from 
entwining the main characters in fabulously complicated plot twists. Casilda, 
the daughter of the Duke of Plaza-Toro in Gilbert and Sullivan’s The Gon¬ 
doliers , is in love with her father’s attendent Luiz. Unfortunately, Casilda 
discovers she has already been married (by proxy) as a babe of six months to 
“the infant son and heir of His Majesty the immeasurably wealthy King of 
Barataria”. It is revealed that this infant son was spirited away by the Grand 
Inquisitor and raised by a “highly respectable gondolier” in Venice as a gon¬ 
dolier. The gondolier had a baby of the same age and could never remember 
which child was which, and so Casilda was in the unenviable position, as 
she puts it, of “being married to one of two gondoliers, but it is impossible 
to say which”. By way of consolation, the Grand Inquisitor informs her that 
“such complications frequently occur”. 

Luckily, such complications don’t frequently occur in natural language. 
Or do they? In fact there are sentences that are so complex that they are hard 
to understand, such as Yngve’s sentence above, or the sentence: 

“77/c Republicans who the senator who she voted for chastised 
were trying to cut all benefits for veterans ”. 
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Studying such sentences, and more generally understanding what level of 
complexity tends to occur in natural language, is an important area of lan¬ 
guage processing. Complexity plays an important role, for example, in de¬ 
ciding when we need to use a particular formal mechanism. Formal mecha¬ 
nisms like finite automata, Markov models, transducers, phonological rewrite 
power rules, and context-free grammars, can be described in terms of their power, 
complexity or equivalently in terms of the complexity of the phenomena that they can 
describe. This chapter introduces the Chomsky hierarchy, a theoretical tool 
that allows us to compare the expressive power or complexity of these dif¬ 
ferent formal mechanisms. With this tool in hand, we summarize arguments 
about the correct formal power of the syntax of natural languages, in particu¬ 
lar English but also including a famous Swiss dialect of German that has the 
interesting syntactic property called cross-serial dependencies. This prop¬ 
erty has been used to argue that context-free grammars are insufficiently 
powerful to model the morphology and syntax of natural language. 

In addition to using complexity as a metric for understanding the rela¬ 
tion between natural language and formal models, the field of complexity is 
also concerned with what makes individual constructions or sentences hard 
to understand. For example we saw above that certain nested or center- 
embedded sentences are difficult for people to process. Understanding what 
makes some sentences difficult for people to process is an important part of 
understanding human parsing. 

13.1 The Chomsky Hierarchy 

How are automata, context-free grammars, and phonological rewrite rules 
related? What they have in common is that each describes a formal lan¬ 
guage, which we have seen is a set of strings over a finite alphabet. But the 
kind of grammars we can write with each of these formalism are of different 
power ative generative power. One grammar is of greater generative power or complex¬ 
ity than another if it can define a language that the other cannot define. We 
will show, for example, that a context-free grammar can be used to describe 
formal languages that cannot be described with a finite state automaton. 

It is possible to construct a hierarchy of grammars, where the set of 
languages describable by grammars of greater power subsumes the set of 
languages describable by grammars of lesser power. There are many possi¬ 
ble such hierarchies; the one that is most commonly used in computational 
linguistics is the Chomsky hierarchy (Chomsky, 1959a), which includes 
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four kinds of grammars, characterized graphically in Figure 13.1. 


Type 0 Languages 


Context-Sensitive Languages 


Context-Free Languages (with no epsilon productions) 


Regular (or Right Linear) Languages 


Figure 13.1 A Venn diagram of the languages on the Chomsky Hierarchy 


What is perhaps not intuitively obvious is that the decrease in the gen¬ 
erative power of languages from the most powerful to the weakest can be 
accomplished merely by placing constraints on the way the grammar rules 
are allowed to be written. The following table shows the four types of gram¬ 
mars in the Chomsky hierarchy, defined by the constraints on the form that 
rules must take. In these examples, A is a single non-terminal, and a, P, and 
y are arbitrary strings of terminal and non-terminal symbols. They may be 
empty unless this is specifically disallowed below, x is an arbitrary string of 
terminal symbols. 


Type 

Common Name 

Rule Skeleton 

Linguistic Example 

0 

Turing Equivalent 

a -¥ P, s.t. a^e 

ATNs 

1 

Context Sensitive 

txA(3 — octP, s.t. y/e 

Tree-Adjoining Grammars 

2 

Context Free 

A-¥ Y 

Phrase Structure Grammars 

3 

Regular 

A —> xB or A —> x 

Finite State Automata 

Figure 13.2 The Chomsky Hierarchy 


Type 0 or unrestricted grammars have no restrictions on the form 
of their rules, except that the left-hand side cannot be the empty string 8. 
Any (non-null) string can be written as any other string (or as 8). Type 0 
grammars characterize the recursively enumerable languages, i.e., those 
whose strings can be listed (enumerated) by a Turing Machine. 

Context-sensitive grammars have rules that rewrite a non-terminal 


RECURSIVELY 

ENUMERABLE 


CONTEXT- 

SENSITIVE 
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CONTEXT- 

FREE 


RIGHT-LINEAR 

LEFT-LINEAR 


symbol A in the context otA(3 as any non-empty string of symbols. They can 

be either written in the form ocA(3 — y ay [3 or in the form A—± y/ a _(3. We 

have seen this latter version in the Chomsky-Halle representation of phono¬ 
logical rules (Chomsky and Halle, 1968), as the following rule of Flapping 
demonstrates: 

/t/-> [dx]/Y _Y 

While the form of these rules seems context-sensitive, Chapter 4 showed 
that phonological rule systems that do not have recursion are actually equiv¬ 
alent in power to the regular grammars. A linguistic model that is known to 
be context-sensitive is the Tree-Adjoining Grammar (Joshi, 1985). 

Another way of conceptualizing a rule in a context-sensitive grammar 
is as rewriting a string of symbols 8 as another string of symbols § in a 
“non-decreasing” way; such that (]) has at least as many symbols as 8. 

We studied context-free grammars in Chapter 9. Context-free rules 
allow any single nonterminal to be rewritten as any string of terminals and 
nonterminals. A nonterminal may also be rewritten as £, although we didn’t 
make use of this option in Chapter 9. 

Regular grammars are equivalent to regular expressions. That is, a 
given regular language can be characterized either by a regular expression 
of the type we discussed in Chapter 2, or by a regular grammar. Regular 
grammars can either be right-linear or left-linear. A rule in a right-linear 
grammar has a single non-terminal on the left, and at most one non-terminal 
on the right-hand side. If there is a non-terminal on the right-hand side, 
it must be the last symbol in the string. The right-hand-side of left-linear 
grammars is reversed (the right-hand-side must start with (at most) a single 
non-terminal). All regular languages have both a left-linear and a right-linear 
grammar. For the rest of our discussion, we will consider only the right- 
linear grammars. 

For example, consider the following regular (right-linear) grammar: 

S —y aA 
S -> bB 
A —y aS 
B -> bbS 
S —y £ 


It is regular, since the left-hand-side of each rule is a single non-terminal 
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and each right-hand side has at most one (rightmost) non-terminal. Here is a 
sample derivation in the language: 

S => aA => aaS => aabB => aabbbS =4> aabbbaA => aabbbaaS => aabbbaa 

We can see that each time S expands, it produces either aaS or bbbS; 
thus the reader should convince themself that this language corresponds to 
the regular expression ( aaUbbb )*. 

We will not present the proof that a language is regular if and only if it 
is generated by a regular language; it was first proved by Chomsky and Miller 
(1958) and can be found in textbooks like Hopcroft and Ullman (1979) and 
Lewis and Papadimitriou (1981). The intuition is that since the nonterminals 
are always at the right or left edge of a rule, they can be processed iteratively 
rather than recursively. 


13.2 HOW TO TELL IF A LANGUAGE ISN’T REGULAR 

How do we know which type of rules to use for a given problem? Could we 
use regular expressions to write a grammar for English? Our do we need to 
use context-free rules or even context-sensitive rules? It turns out that for 
formal languages there are methods for deciding this. That is, we can say for 
a given formal language whether it is representable by a regular expression, 
or whether it instead requires a context-free grammar, and so on. 

So if we want to know if some part of natural language (the phonol¬ 
ogy of English, let’s say, or perhaps the morphology of Turkish) is repre¬ 
sentable by a certain class of grammars, we need to find a formal language 
that models the relevant phenomena and figure out which class of grammars 
is appropriate for this formal language. 

Why should we care whether (say) the syntax of English is repre¬ 
sentable by a regular language? One main reason is that we’d like to know 
which type of rule to use in writing computational grammars for English. 
If English is regular, we would write regular expressions, and use efficient 
automata to process the rules. If English is context-free, we would write 
context-free rules and use the Earley algorithm to parse sentences, and so 
on. 

Another reason to care is that it tells us something about the formal 
properties of different aspects of natural language; it would be nice to know 
where a language ‘keeps’ its complexity; whether the phonological system 
of a language is simpler than the syntactic system, or whether a certain 
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PUMPING 

LEMMA 


kind of morphological system is inherently simpler than another kind. It 
would be a strong and exciting claim, for example, if we could show that the 
phonology of English was capturable by a finite-state machine rather than 
the context-sensitive rules that are traditionally used; it would mean that En¬ 
glish phonology has quite simple formal properties. Indeed, this fact was 
shown by Johnson (1972), and helped lead to the modem work in finite-state 
methods shown in Chapter 3 and Chapter 4. 

The Pumping Lemma 

The most common way to prove that a language is regular is to actually 
build a regular expression for the language. In doing this we can rely on 
the fact that the regular languages are closed under union, concatenation, 
Kleene star, complementation, and intersection. We saw examples of union, 
concatenation, and Kleene star in Chapter 2. So if we can independently 
build a regular expression for two distinct parts of a language, we can use the 
union operator to build a regular expression for the whole language, proving 
that the language is regular. 

Sometimes we want to prove that a given language is not regular. An 
extremely useful tool for doing this is the Pumping Lemma. There are two 
intuitions behind this lemma (our description of the pumping lemma draws 
from Lewis and Papadimitriou (1981) and Hopcroft and Ullman (1979)). 
First, if a language can be modeled by a finite automaton, we must be able 
to decide with a bounded amount of memory whether any string was in the 
language or not. This amount of memory can’t grow larger for different 
strings (since a given automaton has a fixed number of states). Thus the 
memory needs must not be proportional to the length of the input. This 
means for example that languages like a n b n are not likely to be regular, since 
we would need some way to remember what n was in order to make sure that 
there were an equal number of a’s and V s. The second intuition relies on the 
fact that if a regular language has any long strings (longer than the number 
of states in the automaton), there must be some sort of loop in the automaton 
for the language. We can use this fact by showing that if a language doesn’t 
have such a loop, then it can’t be regular. 

Let’s consider a language L and the corresponding deterministic FSA 
M, which has N states. Consider an input string also of length N. The 
machine starts out in state qo ; after seeing 1 symbol it will be in state q \; 
after N symbols it will be in state q n . In other words, a string of length N 
will go through N + 1 states (from qo to q^). But there are only N states 
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in the machine. This means that at least 2 of the states along the accepting 
path (call them q } and q,) must be the same. In other words, somewhere 
on an accepting path from the initial to final state, there must be a loop. 
Figure 13.3 shows an illustration of this point. Let x be the string of symbols 
that the machine reads on going from the initial state qo to the beginning of 
the loop qi. y is the string of symbols that the machine reads in going through 
the loop, z is the string of symbols from the end of the loop (qj) to the final 
accepting state (#/v). 



The machine accepts the concatenation of these three strings of sym¬ 
bols, i.e. xyz. But if the machine accepts xyz it must accept xz! This is 
because the machine could just skip the loop in processing xz. Furthermore, 
the machine could also go around the loop any number of times; thus it must 
also accept xvvz, xyyyz, xyyyyz, etc. In fact, it must accept any string of the 
form xy n z for n > 0. 

The version of the pumping lemma we give is a simplified one for 
infinite regular languages; stronger versions can be stated that also apply to 
finite languages, but this one gives the flavor of this class of lemmas: 

Pumping Lemma. Let L be an infinite regular language. Then 

there are strings x, y, and z, such that y / £ and xy n z F L for n > 0. 

The pumping lemma states that if a language is regular, then there is 
some string y that can be ‘pumped’ appropriately. But this doesn’t mean that 
if we can pump some string y, the language must be regular. Non-regular 
languages may also have strings that can be pumped. Thus the lemma is not 
used for showing that a language is regular. Rather it is used for showing 
that a language isn't regular, by showing that in some language there is no 
possible string that can be pumped in the appropriate way. 

Let’s use the pumping lemma to show that the language a n b n (i.e. the 
language consisting of strings of as followed by an equal number of bs ) is 
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not regular. We must show that any possible string s that we pick cannot be 
divided up into three parts x, y, and z such that y can be pumped. Given a 
random string s from a n b r \ we can distinguish three ways of breaking s up, 
and show that no matter which way we pick, we cannot find some y that can 
be pumped: 

1. y is composed only of as. (This implies that x is all as too, and z 
contains all the bs , perhaps preceded by some as.) But if y is all as , 
that means xy n z has more as than xyz. But this means it has more as 
than 6s, and so cannot be a member of the language a n b n ! 

2. y is composed only of 6s. The problem here is similar to case 1; If y 
is all 6s, that means xy n z has more 6s than xyz, and hence has more 6s 
than as. 

3. y is composed of both as and 6s (this implies that x is only as, while 
z is only 6s). This means that xy n z must have some 6s before as, and 
again cannot be a member of the language a n b n ! 

Thus there is no string in a n b n that can be divided into x, y, z in such a 
way that y can be pumped, and hence a n b n is not a regular language. 

But while a n b n is not a regular language, it is a context-free language. 
In fact, the context-free grammar that models a tl b n only takes two rules! 
Here they are: 

iS —y a S 6 
S e 


Here’s a sample parse tree using this grammar to derive the sentence 

aabb: 
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There is also a pumping lemma for context-free languages, that can be 
used whether or not a language is context-free; complete discussions can be 
found in Hopcroft and Ullman (1979) and Partee (1990). 

Are English and other Natural Languges Regular Languages? 

“How f s business? ” I asked. 

“Lousy and terrible .” Fritz grinned richly. “Or I pull off a new 
deal in the next month or I go as a gigolo ” 

“Either .. .or...? I corrected, from force of professional habit. 

“Fm speaking a lousy English just now?' drawled Fritz, with 
great self-satisfaction. “Sally says maybe shell give me a few 
lessons .” 

Christopher Isherwood. 1935. “Sally Bowles” from 
Goodbye to Berlin 

The pumping lemma provides us with the theoretical machinery for 
understanding the well-known arguments that English (or rather The set of 
strings of English words considered as a formal language’) is not a regular 
language. 

The first such argument was given by Chomsky (1956) and Chomsky 
(1957). He first considers the language {xx^,x e a,fo*}. x R means The re¬ 
verse of x’, so each sentence of this language consists of a string of as and bs 
followed by the reverse or ‘mirror image’ of the string. This language is not 
regular; Partee (1990) shows this by intersecting it with the regular language 
aa*bbaa*. The resulting language is a n b 2 a n ; it is left as an exercise for the 
reader (Exercise 13.3) to show that this is not regular by the pumping lemma. 

Chomsky then showed that a particular subset of the grammar of En¬ 
glish is isomorphic to the mirror image language. He has us consider the fol¬ 
lowing English syntactic structures, where 5i,52---S n , are declarative sen¬ 
tences in English: 

• If Si, then S 2 

• Either 53 , or S 4 

• The man who said S$ is arriving today 

Clearly, Chomsky points out, these are English sentences. Further¬ 
more, in each case there is a lexical dependency between one part of each 
structure and another. “If’ must be followed by “then” (and not, for example, 
“or”). “Either” must be followed by “or” (and not, for example, “because”). 
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CENTER- 

EMBEDDED 


Now these sentences can be embedded in English, one in another; for 
example, we could build sentences like the following: 

If either the man who said S 5 is arriving today or the man who 

said S 5 is arriving tomorrow, then the man who said S () is arriving 

the day after... 

The regular languages are closed under substitution or homomorphism; 
this just means that we can rename any of the symbols in the above sentences. 
Let’s introduce the following substitution: 

if —y a 

then —y a 

either —> b 

or —> b 

other words —> £ 

Now if we apply this substitution to the sentence above, we get the 
following sentence: 

abba 

This sentence has just the mirror-like property that we showed above 
was not capturable by finite-state methods. If we assume that if, then, either, 
or, can be nested indefinitely, then English is isomorphic to xx^,x e 
and hence is not a regular language. Of course, it’s not true that these struc¬ 
tures can be nested indefinitely (sentences like this get hard to understand 
after a couple nestings); we will return to this issues in Section 13.4. 

Partee (1990) gave a second proof that English is not a regular lan¬ 
guage. This proof is based on a famous class of sentences with center- 
embedded structures (Yngve, 1960); here is a variant of these sentences: 

The cat likes tuna fish. 

The cat the dog chased likes tuna fish. 

The cat the dog the rat bit chased likes tuna fish. 

The cat the dog the rat the elephant admired bit chased likes tuna fish. 

As was true with the either/or sentences above, these sentences get 
harder to understand as they get more complex. But for now, let’s assume 
that the grammar of English allows an indefinite number of embeddings. 
Then in order to show that English is not regular, we need to show that 
sentences like these are isomorphic to some non-regular language. Since 
every fronted NP must have its associated verb, these sentences are of the 
form: 
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(the + noun)" (transitive verb)" 1 likes tuna fish. 

The idea of the proof will be to show that sentences of these struc¬ 
tures can be produced by intersecting English with a regular expression. We 
will then use the pumping lemma to prove that the resulting language isn’t 
regular. 

In order to build a simple regular expression that we can intersect with 
English to produce these sentences, we define regular expressions for the 
noun groups (A) and the verbs ( B ): 

A = { the cat, the dog, the rat, the elephant, the kangaroo,...} 

B = { chased, bit, admired, ate, befriended, ... } 

Now if we take the regular expression /A* B* likes tuna fish/ 
and intersect it with English (considered as a set of strings), the resulting lan¬ 
guage is: 

L = x"y" _1 likes tuna fish, x e A,y e B 

This language L can be shown to be non-regular via the pumping 
lemma (see Exercise 13.2). Since the intersection of English with a regu¬ 
lar language is not a regular language, English cannot be a regular language 
either. 

The two arguments we have seen so far are based on English syntax. 
There are also arguments against the finite-state nature of English based on 
English morphology. These morphological arguments are a different kind 
of argument, because they don’t prove that English morphology couldn't be 
regular, only that a context-free model of English morphology is much more 
elegant and captures some useful descriptive generalizations. Let’s summa¬ 
rize one from Sproat (1993) on the prefix en-. Like other English verbs, the 
verbs formed with this prefix can take the suffix -able. So for example the 
verbs enjoy and enrich can be suffixed (< enjoyable , enrichable). But the noun 
or adjective stems themselves cannot take the -able (so * joy able, *richable). 
In other words, -able can attach if the verb-forming prefix en- has already 
attached, but not if it hasn’t. 

The reason for this is very simple; en- creates verbs, and -able only at¬ 
taches to verbs. But expressing this fact in a regular grammar has an annoy¬ 
ing and inelegant redundancy; it would have to have two paths, one through 
joy , one through enjoy , leading to different states, as follows: 

This morphological fact is easy to express in a context-free grammar; 
this is left as an exercise for the reader. 

This kind of ‘elegance’ argument against regular grammars also has 
been made for syntactic phenomena. For example a number of scholars have 
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argued that English number agreement cannot be captured by a regular (or 
even a context-free) grammar. In fact, a simple regular grammar can model 
number agreement, as Pullum and Gazdar (1982) show. They considered the 
following sentences, which have a long-distance agreement dependency: 

Which problem did your professor say she thought was unsolv- 
able? 

Which problems did your professor say she thought were unsolv- 
able? 

Here’s their regular (right-linear) grammar that models these sentences: 

S —> Which problem did your professor say T 
S —> Which problems did your professor say U 
T —> she thought T | you thought T | was unsolvable 
U she thought U | you thought U | were unsolvable 

So a regular grammar could model English agreement. The problem 
with such a grammar is not its computational power, but its elegance, as we 
saw in Chapter 9; such a regular grammar would have a huge explosion in the 
number of grammar rules. But for the purposes of computational complexity, 
agreement is not part of an argument that English is not a regular language. 
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13.3 Is Natural Language Context-Free? 

The previous section argued that English (considered as a set of strings) 
doesn’t seem like a regular language. The natural next question to ask is 
whether English is a context-free language. This question was first asked by 
Chomsky (1956), and has an interesting history; a number of well-known 
attempts to prove English and other languages non-context-free have been 
published, and all except two have been disproved after publication. One 
of these two correct (or at least not-yet disproved) arguments derives from 
the syntax of a dialect of Swiss German; the other from the morphology of 
Bambara, a Northwestern Mande language spoken in Mali and neighboring 
countries. The interested reader should see Pullum (1991, p. 131-146) for 
an extremely witty history of both the incorrect and correct proofs; this sec¬ 
tion will merely summarize one of the correct proofs, the one based on Swiss 
German. 

Both of the correct arguments, and most of the incorrect ones, make use 
of the fact that the following languages, and ones that have similar properties, 
are not context-free: 

{xx\x£{a,b}*} (13.1) 

This language consists of sentences containing two identical strings concate¬ 
nated. The following related language is also not context-free: 

a n b m c n d m (13.2) 

The non-context-free nature of such languages can be shown using the pump¬ 
ing lemma for context-free languages. 

The attempts to prove that the natural languages are not a subset of 
the context-free languages do this by showing that natural languages have a 

CROSS- 

property of these xx languages called cross-serial dependencies. In a cross- dependen 

Cies 

serial dependency, words or larger structures are related in left-to-right order 
as shown in Figure 13.6. A language that has arbitrarily long cross-serial 
dependencies can be mapped to the xx languages. 

The successful proof, independently proposed by Huybregts (1984) 
and Shieber (1985a), shows that a dialect of Swiss German spoken in Zurich 
has cross-serial constraints which make certain parts of that language equiv¬ 
alent to the non-context-free language a n b m c n d m . The intuition is that Swiss 
German allows a sentence to have a string of dative nouns followed by a 
string of accusative nouns, followed by a string of dative-taking verbs, fol¬ 
lowed by a string of accusative-taking verbs. 
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Figure 13.6 A schematic of a cross-serial dependency. 


We will follow the version of the proof presented in Shieber (1985a). 
First, he notes that Swiss German allows verbs and their arguments to be 
ordered cross-serially. Assume that all the example clauses we present below 
are preceded by the string “Jan salt das ” (“Jan says that”): 

(13.3) ...mer emHans es huus halfed aastriiche. 

... we Hans/DAT the house/ACC helped paint. 

‘...we helped Hans paint the house.’ 

Notice the cross-serial nature of the semantic dependency: both nouns 
precede both verbs, and em Hans (Hans) is the argument of halfed (helped) 
while es huus (the house) is the argument of aastriiche (paint). Furthermore, 
there is a cross-serial case dependency between the nouns and verbs; halfed 
(helped) requires the dative, and em Hans is dative, while aastriiche (paint) 
takes the accusative, and es huus (the house) is accusative. 

Shieber points out that this case marking can occur even across triply 
embedded cross-serial clauses like the following: 

(13.4) ...mer d’chind emHans es huus haend 

... we the children/ACC Hans/DAT the house/ACC have 

wele laa halfe aastriiche. 

wanted to let help paint. 

‘...we have wanted to let the children help Hans paint the house.’ 

Shieber notes that among such sentences, those with all dative NPs 
preceding all accusative NPs, and all dative-subcategorizing V’s preceding 
all accusative-subcategorizing V’s are acceptable. 

Jan sait das mer (d’chind)* (em Hans)* es huus haend wele laa* 

halfe* aastriche. 

Let’s call the regular expression above R. Since it’s a regular expres¬ 
sion (you see it only has concatenation and Kleene stars) it must define a 
regular language, and so we can intersect R with Swiss German, and if the 
result is context free, so is Swiss German. 
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But it turns out that Swiss German requires that the number of verbs 
requiring dative objects Qidlfe) must equal the number of dative NPs (em 
Hans) and similarly for accusatives. Furthermore, an arbitrary number of 
verbs can occur in a subordinate clause of this type (subject to performance 
constraints). This means that the result of intersecting this regular language 
with Swiss German is the following language: 

L = Jan sait das mer (d’chind) n (em Hans) m es huus haend wele 
(laa) w (halfe) m aastriiche. 

But this language is of the form wa n b m xc n d m y , which is not context- 

free! 

So we can conclude that Swiss German is not context free. 

13.4 Complexity and Human Processing 

We noted in passing earlier that many of the sentences that were used to 
argue for the non-finite state nature of English (like the ‘center-embedded’ 
sentences) are quite difficult to understand. If you are a speaker of Swiss 
German (or if you have a friend who is), you will notice that the long cross¬ 
serial sentences in Swiss German are also rather difficult to follow. Indeed, 
as Pullum and Gazdar (1982) point out, 

“... precisely those construction-types that figure in the various 
proofs that English is not context-free appear to cause massive 
difficulty in the human processing system... ” 

This brings us to a second use of the term complexity. In the previous 
section we talked about the complexity of a language. Here we turn to a 
question that is as much psychological as computational: the complexity of 
an individual sentence. Why are certain sentences hard to comprehend? Can 
this tell us anything about computational processes? 

Many things can make a sentence hard to understand; complicated 
meanings, extremely ambiguous sentences, the use of rare words, and bad 
handwriting are just a few. Chapter 12 introduced garden-path sentences, 
which are certainly complex, and showed that their complexity was due to 
improper choices made on temporarily ambiguous sentences by the human 
parser. But there is a another, particular, kind of complexity (often called 
‘linguistic complexity’ or ‘syntactic complexity’) that bears an interesting 
relation to the formal-language complexity from the previous section. These 
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are sentences whose complexity arises not from rare words or difficult mean¬ 
ings, but from a particular combination of syntactic structure and human 
memory limitations. Here are some examples of sentences (taken from a 
summary in Gibson (1998)) that cause difficulties when people try to read 
them (we will use the # to mean that a sentence causes extreme processing 
difficulty). In each case the (ii) example is significantly more complex than 
the (i) example: 

(13.5) (i) The cat likes tuna fish. 

(ii) #The cat the dog the rat the elephant admired bit chased likes 
tuna fish. 

(13.6) (i) If when the baby is crying, the mother gets upset, the father will 

help, so the grandmother can rest easily. 

(ii) #Because if when the baby is crying, the mother gets upset, the 
father will help, the grandmother can rest easily. 

(13.7) (i) The child damaged the pictures which were taken by the 

photographer who the professor met at the party. 

(ii) #The pictures which the photographer who the professor met at 
the party took were damaged by the child. 

(13.8) (i) The fact that the employee who the manager hired stole office 

supplies worried the executive. 

(ii) #The executive who the fact that the employee stole office 
supplies worried hired the manager. 

The earliest work on sentences of this type noticed that they all exhibit 
nesting or center-embedding (Chomsky, 1957; Yngve, 1960; Chomsky and 
Miller, 1963; Miller and Chomsky, 1963). That is, they all contain exam¬ 
ples where a syntactic category A is nested within another category B, and 
surrounded by other words (X and Y): 

[fl X U] Y] 

In each of the examples above, part (i) has zero or one embedding, 
while part (ii) has two or more embeddings. For example in (13.5ii) above, 
there are 3 reduced relative clauses embedded inside each other: 

# [s The cat |y the dog |y the rat [y the elephant admired] bit] 
chased] likes tuna fish]. 

In (13.6ii) above, the when clauses are nested inside the if clauses in¬ 
side the because clauses. 
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#[Because [if [when the baby is crying, the mother gets upset], 

the father will help], [the grandmother can rest easily]]. 

In (13.7ii), the relative clause who the professor met at the party is 
nested in between the photographer and took . The relative clause which the 
photographer ... took is then nested between The pictures and were damaged 
by the child. 

#The pictures [ which the photographer [ who the professor met 

at the party ] took ] were damaged by the child. 

Could we explain the difficulty of these nested structures just by say¬ 
ing that they are ungrammatical in English? The answer seems to be no. 
The structures that are used in the complex sentences in (13.5ii)-(13.8ii) are 
the same ones used in the easier sentences (13.5i)-(13.8i). The difference 
between the easy and complex sentences seems to hinge on the number of 
embeddings. But there is no natural way to write a grammar that allows N 
embeddings but not N+l embeddings. 

Rather, the complexity of these sentences seems to be a processing 
phenomenon; some fact about the human parsing mechanism is unable to 
deal with these kinds of multiple nestings. If complexity is a fact about 
‘parsers’ rather than grammars, we would expect sentences to be complex 
for similar reasons in other languages. That is, other languages have different 
grammars, but presumably some of the architecture of the human parser is 
shared from language to language. 

It does seems to be the case that multiply nested structures of this 
kind are also difficult in other languages. For example Japanese allows a 
singly nested clause, but an additional nesting makes a sentence unprocess¬ 
able (Cowper, 1976; Babyonyshev and Gibson, 1999). 

(13.9) Ani-ga imooto-o ijimeta. 

older-brother-NOM younger-sister-ACC bullied 

‘My older brother bullied my younger sister’ 

(13.10) Bebiisitaa-wa [[ani-ga imooto-o 

babysitter-TOP [[older-brother-NOM younger-sister -acc 

ijimeta] to] itta. 

bullied] that] said 

‘The babysitter said that my older brother bullied my younger sister’ 
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(13.11) #Obasan-wa [[Bebiisitaa-ga [[ani-ga 

aunt-TOP [[babysitter-NOM [[older-brother-NOM 
imooto-o ijimeta] to] itta] to] omotteiru. 

younger-sister -acc bullied] that] said] that] thinks 

‘#My aunt thinks that the babysitter said that my older brother bullied 
my younger sister’ 

There are a number of attempts to explain these complexity effects, 
many of which are memory-based. That is, they rely on the intuition that 
each embedding requires some memory resource to store. A sentence with 
too much embedding either uses up too many memory resources, or creates 
multiple memory traces that are confusable with each other. The result is 
that the sentence is too hard to process at all. 

For example Yngve (1960) proposed that the human parser is based on 
a limited-size stack. A stack-based parser places incomplete phrase-structure 
rules on the stack; if multiple incomplete phrases are nested, the stack will 
contain an entry for each of these incomplete rules. Yngve suggests that 
the more incomplete phrase-structure rules the parser needs to store on the 
stack, the more complex the sentence. Yngve’s intuition was that these stack 
limits might mean that English is actually a regular rather than context-free 
language, since a context-free grammar with a finite limit on its stack-size 
can be modeled by a finite automaton. 

An extension to this model (Miller and Chomsky, 1963) proposes that 
self-embedded structures are particularly difficult. A self-embedded struc¬ 
ture contains a syntactic category A nested within another example of A, and 
surrounded by other words (X and Y): 

U X U1 Y] 

Such structures might be difficult because a stack-based parser might 
confused two copies of the rule on the stack. This problem with self-embedding 
is also naturally modeled with an activation-based model, which might have 
only one copy of a particular rule. 

Although these classic parser-based explanations have intuitive appeal, 
and tie in nicely to the formal language complexity issues, it seems un¬ 
likely that they are correct. One problem with them is that there are lots 
of syntactic complexity effects that aren’t explained by these models. For 
example there are significant complexity differences between sentences that 
have the same number of embeddings, such as the well-known difference be¬ 
tween subject-extracted relative clauses ( 13.12ii) and object-extracted rela¬ 
tive clauses (13.12i): 
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(13.12) (i) [ s The reporter [ S f who \s the senator attacked ]] admitted the 

error ]. 

(ii) [s The reporter \ s > who [s attacked the senator ]] admitted the 
error ]. 

The object-extracted relative clauses are more difficult to process (mea¬ 
sured for example by the amount of time it takes to read them (Ford, 1983), 
and other factors; see for example Wanner and Maratsos (1978) and King 
and Just (1991), and Gibson (1998) for a survey). Different researchers have 
hypothesized a number of different factors that might explain this complex¬ 
ity difference. 

For example MacWhinney and colleages MacWhinney (1977, 1982), 
MacWhinney and Csaba Pleh (1988) suggest that it causes difficulty for 
reader to shift perspective from one clause participant to another. Object 
relative require two perspective shifts (from the matrix subject to the relative 
clause subject and then back) while subject relatives require none (the matrix 
subject is the same as the relative clause subject). Another potential source 
of the difficulty in the object-extraction is that the first noun (the reporter ) 
plays two different thematic roles - agent of one clause, patient of the other. 

This conflicting role-assignment may cause difficulties (Bever, 1970). 

Gibson (1998) points out that there is another important difference be¬ 
tween the object and subject extractions: the object extraction has two nouns 
that appear before any verb. The reader must hold on to these two nouns 
without knowing how they will fit into the sentences. Having multiple noun 
phrases lying around that aren’t integrated into the meaning of the sentence 
presumably causes complexity for the reader. 

Based on this observation, Gibson proposes the Syntactic Prediction 
Locality Theory (SPLT), which predicts that the syntactic memory load as- splt 
sociated with a structure is the sum of the memory loads associated with 
each of the words that are obligatorily required to complete the sentence. A 
sentence with multiple noun phrases and no verbs will require multiple verbs 
before the sentence is complete, and will thus have a high load. Memory load 
is also based on how many other new phrases or discourse referents have to 
be held in memory at the same time. Thus the memory load for a word is 
higher if there have been many intervening new discourse referents since the 
word has been predicted. Thus while a sequence of unintegrated NPs is very 
complex, a sequence in which one of the two NPs is a pronoun referring to 
someone already in the discourse is less complex. For example the follow¬ 
ing examples of doubly nested relative clauses are processable because the 
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innermost NP (/) does not introduce a new discourse entity. 

(13.13) (a) A syntax book [that some Italian [that I had never heard of ] 
wrote ] was published by MIT Press (Frank, 1992) 

(b) The pictures [ that the photographer [ who I met at the party ] 
took ] turned out very well. (Bever, personal communication to 
E. Gibson) 

In summary, the early suggestions that the complexity of human sen¬ 
tence processing is related to memory seem to be correct at some level; com¬ 
plexity in both natural and formal languages is caused by the need to keep 
many un-integrated things in memory. This is a deep and fascinating find¬ 
ing about language processing. But the relation between formal and natural 
complexity is not as simple as Yngve and others thought. Exactly which 
factors do play a role in complexity is an exciting research area that is just 
beginning to be investigated. 


13.5 Summary 

This chapter introduced two different ideas of complexity: the complexity 
of a formal language, and the complexity of a human sentence. 

• Grammars can be characterized by their generative power. One gram¬ 
mar is of greater generative power or complexity than another if it can 
define a language that the other cannot define. The Chomsky hier¬ 
archy is a hierarchy of grammars based on their generative power. It 
includes Hiring equivalent, context-sensitive, context-free, and reg¬ 
ular grammars. 

• The pumping lemma can be used to prove that a given language is not 
regular. English is not a regular language, although the kinds of sen¬ 
tences that make English non-regular are exactly those that are hard for 
people to parse. Despite many decades of attempts to prove the con¬ 
trary, English does, however, seem to be a context-free language. The 
syntax of Swiss-German and the morphology of Bambara, by contrast, 
are not context-free, and seem to require context-sensitive grammars. 

• Center-embedded sentences are hard for people to parse. Many the¬ 
ories agree that this difficulty is somehow caused by memory limita¬ 
tions of the human parser. 
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Bibliographical and Historical Notes 

Chomsky (1956) first asked whether finite-state automata or context-free 
grammars were sufficient to capture the syntax of English. His suggestion 
in that paper that English syntax contained “examples that are not easily ex¬ 
plained in terms of phrase structure” was a motivation for his development of 
syntactic transformations. Pullum (1991, p. 131-146) is the definitive histor¬ 
ical study of research on the non-context-free-ness of natural language. The 
early history of attempts to prove natural languages non-context-free is sum¬ 
marized in Pullum and Gazdar (1982). The pumping lemma was originally 
presented by Bar-Hillel et al (1961), who also offer a number of impor¬ 
tant proofs about the closure and decidability properties of finite-state and 
context-free languages. Further details, including the pumping lemma for 
context-free languages (also due to Bar-Hillel et al (1961)) can be found in 
a textbook in automata theory such as Hopcroft and Ullman (1979). 

Yngve’s idea that the difficulty of center-embedded sentences could be 
explained if the human parser was finite-state was taken up by Church (1980) 
in his master’s thesis. He showed that a finite-state parser that implements 
this idea could also explain a number of other grammatical and psycholin¬ 
guists phenomena. While the field has turned toward more sophisticated 
models of complexity, Church’s work can be seen as the beginning of the 
return to finite-state models that characterized the 1980’s and 1990’s. 

There are a number of other ways of looking at complexity that we 
didn’t have space to go into here. One is whether language processing is 
NP-complete. NP-complete is the name of a class of problems which are complete 
suspected to be particularly difficult to process Barton et al (1987) prove a 
number of complexity results about the NP-completeness of natural language 
recognition and parsing. Among other things, they showed that 

1. maintaining lexical and agreement feature ambiguities over a poten¬ 
tially infinite-length sentence causes the problem of recognizing sen¬ 
tences in some unification-based formalisms like Lexical-Functional 
Grammar to be NP-complete. 

2. Two-level morphological parsing (or even just mapping between lexi¬ 
cal and surface form) is also NP-complete. 

Recent work has also begun to link processing complexity with information- 
theoretic measures like Kolmogorov complexity (Juola, 1999). 
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Exercises 

13.1 Is the language a"b 2 a" context-free? 

13.2 Use the pumping lemma to show this language is not regular: 

L = x n y n ~ 1 likes tuna fish, x € A, y e B 

13.3 Partee (1990) showed that the language xx R ,x E a,b* is not regular, by 
intersecting it with the regular language aa*bbaa*. The resulting language 
is a n b 2 a n . Use the pumping lemma to show that this language is not regular, 
completing the proof that xx^ ,x E a , b* is not regular. 

13.4 Build a context-free grammar for the language 
L= {xx^|x £ a,b*} 

13.5 Using a context-free grammar to represent the English morphological 
facts described in Figure 13.5. Assume that en- applies to a particular class 
of adjectives (call it Adj' 35 ) and nouns (call it Noun\(fi. 



Semantics is the study of the meaning of linguistic utterances. For 
our purposes, this amounts to the study of formal representations that 
are capable of capturing the meanings of linguistic utterances, and 
the study of algorithms that are capable of mapping from linguistic 
utterances to appropriate meaning representations. As we will see, the 
most important topic to be addressed in this study is how the meaning 
of an utterance is related to the meanings of the phrases, words, and 
morphemes that make it up. Following tradition, issues related to 
speakers and hearers, and the context in which utterances are found, 
will be deferred to Part IV, which takes up the topic of Pragmatics. 

This part of the book begins by exploring ways to represent the 
meaning of utterances, focusing on the use of First Order Predicate 
Calculus. It next explores various theoretical and practical approaches 
to compositional semantic analysis, as well as its use in practical prob¬ 
lems such as question answering and information extraction. It next 
turns to the topic of the meanings of individual words, the role of 
meaning in the organization of a lexicon, and algorithms for word- 
sense disambiguation. Finally, it covers the topic of information re¬ 
trieval, an application area of great importance that operates almost 
entirely on the basis of individual word meanings. 






REPRESENTING 

MEANING 


Ishmael: Surely all this is not without meaning . 
Herman Melville, Moby Dick 


The approach to semantics that is introduced here, and is elaborated 
on in the next four chapters, is based on the notion that the meaning of lin¬ 
guistic utterances can be captured in formal structures, which we will call 
meaning representations. Correspondingly, the frameworks that are used 
to specify the syntax and semantics of these representations will be called 
meaning representation languages. These meaning representations play 
a role analogous to that of the phonological, morphological, and syntactic 
representations introduced in earlier chapters. 

The need for these representations arises when neither the raw linguis¬ 
tic inputs, nor any of the structures derivable from them by any of the trans¬ 
ducers we have studied, facilitate the kind of semantic processing that is de¬ 
sired. More specifically, what is needed are representations that can bridge 
the gap from linguistic inputs to the kind of non-linguistic knowledge needed 
to perform a variety of tasks involving the meaning of linguistic inputs. 

To illustrate this idea, consider the following everyday language tasks 
that require some form of semantic processing. 


MEANING 

REPRESENTA¬ 
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MEANING 

REPRESENTA¬ 
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• Answering an essay question on an exam. 

• Deciding what to order at a restaurant by reading a menu. 

• Learning to use a new piece of software by reading the manual. 

• Realizing that you’ve been insulted. 

• Following a recipe. 




498 


Chapter 14. Representing Meaning 


It should be clear that simply having access to the kind of phonological, mor¬ 
phological, and syntactic representations we have discussed thus far will not 
get us very far on accomplishing any of these tasks. These tasks require ac¬ 
cess to representations that link the linguistic elements involved in the task to 
the non-linguistic knowledge of the world needed to successfully accomplish 
them. For example, some of the knowledge of the world needed to perform 
the above tasks includes: 

• Answering and grading essay questions requires background knowl¬ 
edge about the topic of the question, the desired knowledge level of 
the students, and how such questions are normally answered. 

• Reading a menu and deciding what to order, giving advice about where 
to go to dinner, following a recipe, and generating new recipes all re¬ 
quire deep knowledge about food, its preparation, what people like to 
eat and what restaurants are like. 

• Learning to use a piece of software by reading a manual, or giving ad¬ 
vice about how to do the same, requires deep knowledge about current 
computers, the specific software in question, similar software applica¬ 
tions, and knowledge about users in general. 

In the representational approach being explored here, we take linguis¬ 
tic inputs and construct meaning representations that are made up of the 
same kind of stuff that is used to represent this kind of everyday common- 
sense knowledge of the world. The process whereby such representations 
are created and assigned to linguistic inputs is called semantic analysis. 

To make this notion more concrete, consider Figure 14.1, which shows 
sample meaning representations for the sentence I have a car using four 
frequently used meaning representation languages. The first row illustrates a 
sentence in First Order Predicate Calculus, which will be covered in detail in 
Section 14.3; the graph in the center illustrates a Semantic Network , which 
will be discussed further in Section 14.5; the third row contains a Conceptual 
Dependency diagram, discussed in more detail in Chapter 16, and finally a 
frame-based representation, also covered in Section 14.5. 

While there are a number of significant differences among these four 
approaches to representation, at an abstract level they all share as a common 
foundation the notion that a meaning representation consists of structures 
composed from a set of symbols. When appropriately arranged, these sym¬ 
bol structures are taken to correspond to objects, and relations among ob¬ 
jects, in some world being represented. In this case, all four representations 
make use of symbols corresponding to the speaker, a car, and a number of 
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Bx,yHaving(x) A Haver{Speaker,x) A HadThing(y,x) A Car(y) 

r“' 


Haver 

Had-Thing 

Speaker 

Car 

Car 

Having 

ft POSS-BY 

Haver: Speaker 

Speaker 

HadThing: Car 

Figure 14.1 A list of symbols, two directed graphs, and a record structure: 
a sampler of meaning representations for I have a car. 


relations denoting the possession of one by the other. 

It is important to note that these representations can be viewed from at 
least two distinct perspectives in all four of these approaches: as represen¬ 
tations of the meaning of the particular linguistic input I have a car , and as 
representations of the state of affairs in some world. It is this dual perspec¬ 
tive that allows these representations to be used to link linguistic inputs to 
the world and to our knowledge of it. 

The structure of this part of the book parallels that of the previous parts. 
We will alternate discussions of the nature of meaning representations with 
discussions of the computational processes that can produce them. More 
specifically, this chapter introduces the basics of what is needed in a mean¬ 
ing representation, while Chapter 15 introduces a number of techniques for 
assigning meanings to linguistic inputs. Chapter 16 explores a range of com¬ 
plex representational issues related to the meanings of words. Chapter 17 
then explores some robust computational methods designed to exploit these 
lexical representations. 

Note that since the emphasis of this chapter is on the basic require¬ 
ments of meaning representations, we will defer a number of extremely im¬ 
portant issues to later chapters. In particular, the focus of this chapter is on 
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representing what is sometimes called the literal meaning of sentences. By 
this, we have in mind representations that are closely tied to the conventional 
meanings of the words that are used to create them, and that do not reflect 
the context in which they occur. The shortcomings of such representations 
with respect to phenomena such as idioms and metaphor will be discussed 
in the next two chapters, while the role of context in ascertaining the deeper 
meaning of sentences will be covered in Chapters 18 and 19. 

There are three major parts to this chapter. Section 14.1 explores some 
of the practical computational requirements for what is needed in a meaning 
representation language. Section 14.2 then discusses some of the ways that 
language is structured to convey meaning. Section 14.3 then provides an 
introduction to First Order Predicate Calculus, which has historically been 
the principal technique used to investigate semantic issues. 


14.1 Computational Desiderata for Representations 

We begin by considering the issue of why meaning representations are needed 
and what they should do for us. To focus this discussion, we will consider in 
more detail the task of giving advice about restaurants to tourists. In this dis¬ 
cussion, we will assume that we have a computer system that accepts spoken 
language queries from tourists and construct appropriate responses by using 
a knowledge base of relevant domain knowledge. A series of examples will 
serve to introduce some of the basic requirements that a meaning represen¬ 
tation must fulfill, and some of the complications that inevitably arise in the 
process of designing such meaning representations. In each of these exam¬ 
ples, we will examine the role that the representation of the meaning of the 
request must play in the process of satisfying it. 

Verifiability 

Let us begin by considering the following simple question. 

(14.1) Does Maharani serve vegetarian food? 

This example illustrates the most basic requirement for a meaning represen¬ 
tation: it must be possible to use the representation to determine the relation¬ 
ship between the meaning of a sentence and the world as we know it. In other 
words, we need to be able to determine the truth of our representations. The 
most straightforward way to implement this notion is make it possible for a 
system to compare, or match , the representation of the meaning of an input 
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against the representations in its knowledge base, its store of information 
about its world. 

In this example, let us assume that the meaning of this question con¬ 
tains, as a component, the meaning underlying the proposition Maharani 
serves vegetarian food. For now, we will simply gloss this representation as: 

Serves (Maharani , VegetarianFood) 

It is this representation of the input that will be matched against the 
knowledge base of facts about a set of restaurants. If the system finds a 
representation matching the input proposition in its knowledge base, it can 
return an affirmative answer. Otherwise, it must either say No , if its knowl¬ 
edge of local restaurants is complete, or say that it does not know if there is 
reason to believe that its knowledge is incomplete. 

This notion is known as verifiability, and concerns a system’s ability 
to compare the state of affairs described by a representation to the state of 
affairs in some world as modeled in a knowledge base. 1 

Unambiguous Representations 

The domain of semantics, like all the other domains we have studied, is 
subject to ambiguity. Specifically, single linguistic inputs can legitimately 
have different meaning representations assigned to them based on the cir¬ 
cumstances in which they occur. 

Consider the following example from the berp corpus. 

(14.2) I wanna eat someplace that’s close to icsi. 

Given the allowable argument structures for the verb eat , this sentence can 
either mean that the speaker wants to eat at some nearby location, or under 
a Godzilla as speaker interpretation, the speaker may want to devour some 
nearby location. The answer generated by the system for this request will 
depend on which interpretation is chosen as the correct one. 

Since ambiguities such as this abound in all genres of all languages, 
some means of determining that certain interpretations are preferable (or 
alternatively less preferable) than others is needed. The various linguistic 
phenomenon that give rise to such ambiguities, and the techniques that can 
be employed to deal with them, will be discussed in detail in the next four 
chapters. 

L This is a fairly practical characterization of verifiability. More theoretical views of this 
notion are briefly covered in Section 14.6. 
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Our concern in this chapter, however, is with the status of our meaning 
representations with respect to ambiguity, and not with how we arrive at 
correct interpretations. Since we reason about, and act upon, the semantic 
content of linguistic inputs, the final representation of an input’s meaning 
should be free from any ambiguity. Therefore, regardless of any ambiguity 
in the raw input, it is critical that a meaning representation language support 
representations that have a single unambiguous interpretation. 2 
vagueness A concept closely related to ambiguity is vagueness. Like ambiguity, 

vagueness can make it difficult to determine what to do with a particular 
input based on its meaning representation. Vagueness, however, does not 
give rise to multiple representations. 

Consider the following request as an example. 

(14.3) I want to eat Italian food. 

While the use of the phrase Italian food may provide enough information for 
a restaurant advisor to provide reasonable recommendations, it is neverthe¬ 
less quite vague as to what the user really wants to eat. Therefore, a vague 
representation of the meaning of this phrase may be appropriate for some 
purposes, while a more specific representation may be needed for other pur¬ 
poses. It will, therefore, be advantageous for a meaning representation lan¬ 
guage to support representations that maintain a certain level of vagueness. 
Note that it is not always easy to distinguish ambiguity from vagueness. 
Zwicky and Sadock (1975) provide a useful set of tests that can be used as 
diagnostics. 

Canonical Form 

The notion that single sentences can be assigned multiple meanings leads to 
the related phenomenon of distinct inputs that should be assigned the same 
meaning representation. Consider the following alternative ways of express¬ 
ing Example 14.1. 

(14.4) Does Maharani have vegetarian dishes? 

(14.5) Do they have vegetarian food at Maharani? 

(14.6) Are vegetarian dishes served at Maharani? 

(14.7) Does Maharani serve vegetarian fare? 

2 This does not foreclose the use of intermediate semantic representations that maintain 
some level of ambiguity on the way to a single unambiguous form. Examples of such repre¬ 
sentations will be discussed in Chapter 15. 
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Given that these alternatives use different words and have widely vary¬ 
ing syntactic analyses, it would not be unreasonable to expect them to have 
substantially different meaning representations. Such a situation would, 
however, have undesirable consequences for our matching approach to de¬ 
termining the truth of our representations. If the system’s knowledge base 
contains only a single representation of the fact in question, then the rep¬ 
resentations underlying all but one of our alternatives will fail to produce a 
match. We could, of course, store all possible alternative representations of 
the same fact in the knowledge base, but this would lead to an enormous 
number of problems related to keeping such a knowledge base consistent. 

The way out of this dilemma is motivated by the fact that since the an¬ 
swers given for each of these alternatives should be the same in all situations, 
we might say that they all mean the same thing, at least for the purposes of 
giving restaurant recommendations. In other words, at least in this domain, 
we can legitimately consider assigning the same meaning representation to 
the propositions underlying each of these requests. Taking such an approach 
would guarantee that our matching scheme for answering Yes-No questions 
will still work. 

The notion that inputs that mean the same thing should have the same 
meaning representation is known as the doctrine of canonical form. This 
approach greatly simplifies various reasoning tasks since systems need only 
deal with a single meaning representation for a potentially wide range of 
expressions. 

Canonical form does, of course, complicate the task of semantic anal¬ 
ysis. To see this, note that the alternatives given above use completely dif¬ 
ferent words and syntax to refer to vegetarian fare and to what restaurants do 
with it. More specifically, to assign the same representation to all of these 
requests our system will have to conclude that vegetarian fare , vegetarian 
dishes and vegetarian food refer to the same thing in this context, that the 
use here of having and serving are similarly equivalent, and that the differ¬ 
ent syntactic parses underlying these requests are all compatible with the 
same meaning representation. 

Being able to assign the same representation to such diverse inputs is 
a tall order. Fortunately there are some systematic meaning relationships 
among word senses and among grammatical constructions that can be ex¬ 
ploited to make this task tractable. Consider the issue of the meanings of 
the words food , dish and fare in these examples. A little introspection, or a 
glance at a dictionary, reveals that these words have a fair number of distinct 
uses. Fortunately, it also reveals that there is at least one sense that is shared 
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among them all. If a system has the ability to choose that shared sense, then 
an identical meaning representation can be assigned to the phrases contain¬ 
ing these words. 

senses In general, we say that these words all have various word senses and 

that some of the senses are synonymous with one another. The process of 
disamb^gua e choosing the right sense in context is called word sense disambiguation, 
or word sense tagging by analogy to part-of-speech tagging. The topics of 
synonymy, sense tagging, and a host of other topics related to word meanings 
will be covered in Chapters 16 and 17. Suffice it to say here that the fact that 
inputs may use different words does not preclude the assignment of identical 
meanings to them. 

Just as there are systematic relationships among the meanings of dif¬ 
ferent words, there are similar relationships related to the role that syntactic 
analyses play in assigning meanings to sentences. Specifically, alternative 
syntactic analyses often have meanings that are, if not identical, at least sys¬ 
tematically related to one another. Consider the following pair of examples. 

(14.8) Maharani serves vegetarian dishes. 

(14.9) Vegetarian dishes are served by Maharani. 

Despite the different placement of the arguments to serve in these examples, 
we can still assign Maharani and vegetarian dishes to the same roles in both 
of these examples because of our knowledge of the relationship between ac¬ 
tive and passive sentence constructions. In particular, we can use knowledge 
of where grammatical subjects and direct objects appear in these construc¬ 
tions to assign Maharani , to the role of the server, and vegetarian dishes to 
the role of thing being served in both of these examples, despite the fact that 
they appear in different surface locations. The precise role of the grammar in 
the construction of meaning representations will be covered in Chapter 15. 

Inference and Variables 

Continuing with the topic of the computational purposes that meaning rep¬ 
resentations should serve, we should consider more complex requests such 
as the following. 

(14.10) Can vegetarians eat at Maharani? 

Here, it would be a mistake to invoke canonical form to force our system to 
assign the same representation to this request as for the previous examples. 
The fact that this request results in the same answer as the others arises not 
because they mean the same thing, but because there is a commonsense con- 
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nection between what vegetarians eat and what vegetarian restaurants serve. 

This is a fact about the world and not a fact about any particular kind of 
linguistic regularity. This implies that no approach based on canonical form 
and simple matching will give us an appropriate answer to this request. What 
is needed is a systematic way to connect the meaning representation of this 
request with the facts about the world as they are represented in a knowledge 
base. 

We will use the term inference to refer generically to a system’s abil- inference 
ity to draw valid conclusions based on the meaning representation of inputs 
and its store of background knowledge. It must be possible for the system 
to draw conclusions about the truth of propositions that are not explicitly 
represented in the knowledge base, but are nevertheless logically derivable 
from the propositions that are present. 

Now consider the following somewhat more complex request. 

(14.11) I’d like to find a restaurant where I can get vegetarian food. 

Unlike our previous examples, this request does not make reference to any 
particular restaurant. The user is stating that they would like information 
about an unknown and unnamed entity that is a restaurant that serves veg¬ 
etarian food. Since this request does not mention any particular restaurant, 
the kind of simple matching-based approach we have been advocating is not 
going to work. Rather, answering this request requires a more complex kind 
of matching that involves the use of variables. We can gloss a representation 
containing such variables as follows. 

Serves(x, VegetarianFood ) 

Matching such a proposition succeeds only if the variable x can be re¬ 
placed by some known object in the knowledge base in such a way that the 
entire proposition will then match. The concept that is substituted for the 
variable can then be used to fulfill the user’s request. Of course, this simple 
example only hints at the issues involved in the use of such variables. Suffice 
it to say that linguistic inputs contain many instances of all kinds of indef¬ 
inite references and it is therefore critical for any meaning representation 
language to be able to handle this kind of expression. 

Expressiveness 

Finally, to be useful a meaning representation scheme must be expressive 
enough to handle an extremely wide range of subject matter. The ideal sit¬ 
uation, of course, would be to have a single meaning representation lan- 
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guage that could adequately represent the meaning of any sensible natural 
language utterance. Although this is probably too much to expect from any 
single representational system, Section 14.3 will show that First Order Pred¬ 
icate Calculus is expressive enough to handle quite a lot of what needs to be 
represented. 


14.2 Meaning Structure of Language 


MEANING 

STRUCTURE 

OF 
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The previous section focused on some of the purposes that meaning rep¬ 
resentations must serve, without saying much about what we will call the 
meaning structure of language. By this, we have in mind the various meth¬ 
ods by which human languages convey meaning. These include a variety of 
conventional form-meaning associations, word-order regularities, tense sys¬ 
tems, conjunctions and quantifiers, and a fundamental predicate-argument 
structure. The remainder of this section focuses exclusively on this last no¬ 
tion of a predicate-argument structure, which is the mechanism that has had 
the greatest practical influence on the nature of meaning representation lan¬ 
guages. The remaining topics will be addressed in Chapter 15 where the 
primary focus will be on how they contribute to how meaning representa¬ 
tions are assembled, rather than on the nature of the representations. 


Predicate-Argument Structure 

It appears to be the case that all human languages have a form of predicate- 
argument arrangement at the core of their semantic structure. To a first ap¬ 
proximation, this predicate-argument structure asserts that specific relation¬ 
ships hold among the various concepts underlying the constituent words and 
phrases that make up sentences. It is largely this underlying structure that 
permits the creation of a single composite meaning representation from the 
meanings of the various parts of an input. One of the most important jobs 
of a grammar is to help organize this predicate-argument structure. Corre¬ 
spondingly, it is critical that our meaning representation languages support 
the predicate-argument structures presented to us by language. 

We have already seen the beginnings of this concept in our discus¬ 
sion of verb complements in Chapter 9 and Chapter 11. There we saw that 
verbs dictate specific constraints on the number, grammatical category, and 
location of the phrases that are expected to accompany them in syntactic 
structures. To briefly review this idea, consider the following examples. 
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(14.12) I want Italian food. 

(14.13) I want to spend less than five dollars. 

(14.14) I want it to be close by here. 

These examples can be classified as having one of the following three syn¬ 
tactic argument frames. 

NP want NP 
NP want Inf-VP 
NP want NPInf-VP 

These syntactic frames specify the number, position and syntactic cat¬ 
egory of the arguments that are expected to accompany a verb. For example, 
the frame for the variety of want that appears in Example 14.12 specifies the 
following facts: 

• There are two arguments to this predicate. 

• Both arguments must be NPs. 

• The first argument is pre-verbal and plays the role of the subject. 

• The second argument is post-verbal and plays the role of the direct 
object. 

As we have shown in previous chapters, this kind of information is quite 
valuable in capturing a variety of important facts about syntax. By analyzing 
easily observable semantic information associated with these frames, we can 
also gain considerable insight into our meaning representations. We will 
begin by considering two extensions of these frames into the semantic realm: 
semantic roles and semantic restrictions on these roles. 

The notion of a semantic role can be understood by looking at the sim¬ 
ilarities among the arguments in Examples 14.12 through 14.14. In each of 
these cases, the pre-verbal argument always plays the role of the entity do¬ 
ing the wanting, while the post-verbal argument plays the role of the concept 
that is wanted. By noticing these regularities and labeling them accordingly, 
we can associate the surface arguments of a verb with a set of discrete roles 
in its underlying semantics. More generally, we can say that verb subcatego¬ 
rization frames allow the linking of arguments in the surface structure with 
the semantic roles these arguments play in the underlying semantic repre¬ 
sentation of an input. The study of roles associated with specific verbs and 
across classes of verbs is usually referred to as thematic role or case role 
analysis and will be studied in more detail in Section 14.4 and Chapter 16. 

The notion of semantic restrictions arises directly from these semantic 
roles. Returning to Examples 14.12 through 14.14, we can see that it is not 
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merely the case that each initial noun phrase argument will be the wanter 
but that only certain kinds, or categories , of concepts can play the role of 
wanter in any straightforward manner. Specifically, want restricts the con¬ 
stituents appearing as the first argument to those whose underlying concepts 
can actually partake in a wanting. Traditionally, this notion is referred to as 
restriction a selection restriction. Through the use of these selection restrictions, verbs 
can specify semantic restrictions on their arguments. 

Before leaving this topic, we should note that verbs are by no means 
the only objects in a grammar that can carry a predicate-argument structure. 
Consider the following phrases from the berp corpus. 

(14.15) an Italian restaurant under fifteen dollars 

In this example, the meaning representation associated with the preposition 
under can be seen as having something like the following structure. 

U nder(ItalianRestaurant, $15) 

In other words, prepositions can be characterized as two-argument predicates 
where the first argument is an object that is being placed in some relation to 
the second argument. 

Another non-verb based predicate-argument structure is illustrated in 
the following example. 

(14.16) make a reservation for this evening for a table for two persons at 8. 

Here, the predicate-argument structure is based on the concept under¬ 
lying the noun reservation , rather than make , the main verb in the phrase. 
This example gives rise to a four argument predicate structure like the fol¬ 
lowing. 

Reservation(Hearer ; Today , 8/W. 2) 

This discussion makes it clear that any useful meaning representation 
language must be organized in a way that supports the specification of se¬ 
mantic predicate-argument structures. Specifically, this support must include 
support for the kind of semantic information that languages present: 

• Variable arity predicate-argument structures. 

• The semantic labeling of arguments to predicates. 

• The statement of semantic constraints on the fillers of argument roles. 
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14.3 First Order Predicate Calculus 

First Order Predicate Calculus (fopc) is a flexible, well-understood, and 
computationally tractable approach to the representation of knowledge that 
satisfies many of the requirements raised in Sections 14.1 and 14.2 for a 
meaning representation language. Specifically, it provides a sound computa¬ 
tional basis for the verifiability, inference, and expressiveness requirements. 
However, the most attractive feature of fopc is the fact that it makes very 
few specific commitments as to how things ought to be represented. As we 
will see, the specific commitments it does make are ones that are fairly easy 
to live with; the represented world consists of objects, properties of objects, 
and relations among objects. 

The remainder of this section first provides an introduction to the basic 
syntax and semantics of fopc and then describes the application of fopc 
to a number of linguistically relevant topics. Section 14.6 then discusses 
the connections between fopc and some of the other representations shown 
earlier in Figure 14.1. 

Elements of FOPC 

We will explore fopc in a bottom-up fashion by first examining its 
various atomic elements and then showing how they can be composed to 
create larger meaning representations. Figure 14.2, which provides a com¬ 
plete context-free grammar for the particular syntax of fopc that we will be 
using, will be our roadmap for this section. 

Let’s begin by examining the notion of a Term, the fopc device for 
representing objects. As can be seen from Figure 14.2, FOPC provides three 
ways to represent these basic building blocks: constants, functions, and vari¬ 
ables. Each of these devices can be thought of as a way of naming, or point¬ 
ing to, an object in the world under consideration. 

Constants in fopc refer to specific objects in the world being de¬ 
scribed. Such constants are conventionally depicted as either single capi¬ 
talized letters such as A and B or single capitalized words that are often rem¬ 
iniscent of proper nouns such as Maharani and Harry. Like programming 
language constants, fopc constants refer to exactly one object. Objects can, 
however, have multiple constants that refer to them. 

Functions in fopc correspond to concepts that which are often ex¬ 
pressed in English as genitives such as the location of Maharani or Maha- 
ranVs location. A fopc translation of such an expression might look like 
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Formula —> AtomicFormula 

| Formula Connective Formula 
| Quantifier Variable ,... Formula 
| -i Formula 
| ( Formula ) 

AtomicFormula —> Predicate (Term ,...) 


Term -A Function(Term ,...) 
| Constant 
I Variable 


Connective 

Quantifier 

Constant 

Variable 

Predicate 

Function 


—^ A | V | 

^ V| 3 

—> A | VegetarianFood \ Maharani ••• 
-a *| y | **• 

-A Serves \ Near | • • • 

LocationOf \ CuisineOf | • 


Figure 14.2 A context-free grammar specification of the syntax of First 
Order Predicate Calculus representations. (Adapted from Russell and Norvig 
(1995).) 


the following. 

LocationOf {Maharani) 

fopc functions are syntactically the same as single argument predicates. It 
is important to remember, however, that while they have the appearance of 
predicates they are in fact Terms in that they refer to unique objects. Func¬ 
tions provide a convenient way to refer to specific objects without having 
to associate a named constant with them. This is particularly convenient in 
cases where many named objects, like restaurants, will have a unique con¬ 
cept such as a location associated with them. 
variable The notion of a variable is our final fopc mechanism for referring to 
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objects. Variables, which are normally depicted as single lower-case letters, 
give us the ability to make assertions and draw inferences about objects with¬ 
out having to make reference to any particular named object. This ability to 
make statements about anonymous objects comes in two flavors: making 
statements about a particular unknown object and making statements about 
all the objects in some arbitrary world of objects. We will return to the topic 
of variables after we have presented quantifiers, the elements of fopc that 
will make them useful. 

Now that we have the means to refer to objects, we can move on to the 
fopc mechanisms that are used to state relations that hold among objects. 

As one might guess from its name, fopc is organized around the notion of 
the predicate. Predicates are symbols that refer to, or name, the relations that 
hold among some fixed number of objects in a given domain. Returning to 
the example introduced informally in Section 14.1, a reasonable fopc repre¬ 
sentation for Maharani serves vegetarian food might look like the following 
formula. 

Serves(Maharani , VegetarianFood ) 

This fopc sentence asserts that Serves, a two-place predicate, holds between 
the objects denoted by the constants Maharani and VegetarianFood. 

A somewhat different use of predicates is illustrated by the following 
typical representation for a sentence like Maharani is a restaurant. 

Restaurant (Maharani) 

This is an example of a one-place predicate that is used, not to relate multiple 
objects, but rather to assert a property of a single object. In this case, it 
encodes the category membership of Maharani. We should note that while 
this is a commonplace way to deal with categories it is probably not the 
most useful. Section 14.4 will return to the topic of the representation of 
categories. 

With the ability to refer to objects, to assert facts about objects, and 
to relate objects to one another, we have the ability to create rudimentary 
composite representations. These representations correspond to the atomic 
formula level in Figure 14.2. Recall that this ability to create composite 
meaning representations was one of the core components of the meaning 
structure of language described in Section 14.2. 

This ability to compose complex representations is not limited to the 
use of single predicates. Larger composite representations can also be put 
together through the use of logical connectives. As can be seen from Figure connec- 
14.2, logical connectives give us the ability to create larger representations 
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by conjoining logical formulas using one of three operators. Consider, for 
example, the following berp sentence and one possible representation for it. 

(14.17) I only have five dollars and I don’t have a lot of time. 
Have{Speaker,FiveDollars) A -> Have (Speaker, Lot O/Time) 

The semantic representation for this example is built up in a straightforward 
way from semantics of the individual clauses through the use of the A and 
-> operators. Note that the recursive nature of the grammar in Figure 14.2 
allows an infinite number of logical formulas to be created through the use 
of these connectives. Thus as with syntax, we have the ability to create an 
infinite number of representations using a finite device. 

The Semantics of FOPC 

The various objects, properties, and relations represented in a fopc knowl¬ 
edge base acquire their meanings by virtue of their correspondence to ob¬ 
jects, properties, and relations out in the external world being modeled by 
the knowledge base, fopc sentences can, therefore, be assigned a value of 
True or False based on whether the propositions they encode are in accord 
with the world or not. 

Consider the following example. 

(14.18) Ay Caramba is near ICSI. 

Capturing the meaning of this example in fopc involves identifying the 
Terms and Predicates that correspond to the various grammatical elements 
in the sentence, and creating logical formulas that capture the relations im¬ 
plied by the words and syntax of the sentence. For this example, such an 
effort might yield something like the following. 

Near(LocationOf(AyCaramba ), LocationOf(ICSI )) 

The meaning of this logical formula then arises from the relationship 
between the terms LocationOf(AyCaramba ), Localion()f(IGS1 ), the predi¬ 
cate Near , and the objects and relation they correspond to in the world being 
modeled. Specifically, this sentence can be assigned a value of True or False 
based on whether or not the real Ay Caramba is actually close to ICSI or not. 
Of course, since our computers rarely have direct access to the outside world 
we have to rely on some other means to determine the truth of formulas like 
this one. 

For our current purposes, we will adopt what is known as a database 
semantics for determining the truth of our logical formulas. Operationally, 
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atomic formulas are taken to be true if they are literally present in the knowl¬ 
edge base or if they can be inferred from other formula that are in the knowl¬ 
edge base. The interpretations of formulas involving logical connectives is 
based on the meaning of the components in the formulas combined with the 
meanings of the connectives they contain. Figure 14.3 gives interpretations 
for each of the logical operators shown in Figure 14.2. 


p 

Q 

-.p 

PAQ 

PVQ 

P^Q 

False 

False 

True 

False 

False 

True 

False 

True 

True 

False 

True 

True 

True 

False 

False 

False 

True 

False 

True 

True 

False 

True 

True 

True 

Figure 14.3 Truth table giving the semantics of the various logical 
connectives. 


The semantics of the A (and), and -> (not) operators are fairly straight¬ 
forward, and are correlated with at least some of the senses of their corre¬ 
sponding English terms. However, it is worth pointing out that the V (or) 
operator is not disjunctive in the same way that the corresponding English 
word is, and that the => (implies) operator is only loosely based on any 
commonsense notions of implication or causation. As we will see in more 
detail in Section 14.4, in most cases it is safest to rely directly on the en¬ 
tries in the truth table, rather than on intuitions arising from the names of the 
operators. 

Variables and Quantifiers 

We now have all the machinery necessary to return to our earlier discussion 
of variables. As noted above, variables are used in two ways in fopc: to re¬ 
fer to particular anonymous objects and to refer generically to all objects in 
a collection. These two uses are made possible through the use of operators 
known as quantifiers. The two operators that are basic to fopc are the ex- quantifiers 
istential quantifier, which is denoted 3, and is pronounced as “there exists”, 
and the universal quantifier, which is denoted V, and is pronounced as “for 
all”. 

The need for an existentially quantified variable is often signaled by 
the presence of an indefinite noun phrase in English. Consider the following 
example. 

(14.19) a restaurant that serves Mexican food near ICSI. 
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Here reference is being made to an anonymous object of a specified category 
with particular properties. The following would be a reasonable representa¬ 
tion of the meaning of such a phrase. 

3xRestaurant(x) 

AServes(x, MexicanFood ) 

A Near( (Location O f(x ), Location O f(ICSI )) 

The existential quantifier at the head of this sentence instructs us on 
how to interpret the variable x in the context of this sentence. Informally, it 
says that for this sentence to be true there must be at least one object such 
that if we were to substitute it for the variable x, the resulting sentence would 
be true. For example, if AyCaramba is a Mexican restaurant near ICSI, then 
substituting AyCaramba for x results in the following logical formula. 

Restaurant (AyCaramba) 

AServes (AyCaramba, MexicanFood) 

AN ear ((Location Of (AyCaramba ), Local ion Of (ICSI )) 

Based on the semantics of the A operator, this sentence will be true if 
all of its three component atomic formulas are true. These in turn will be true 
if they are either present in the system’s knowledge base or can be inferred 
from other facts in the knowledge base. 

The use of the universal quantifier also has an interpretation based on 
substitution of known objects for variables. The substitution semantics for 
the universal quantifier takes the expression for all quite literally; the V oper¬ 
ator states that for the logical formula in question to be true the substitution 
of any object in the knowledge base for the universally quantified variable 
should result in a true formula. This is in marked contrast to the 3 operator 
which only insists on a single valid substitution for the sentence to be true. 

Consider the following example. 

(14.20) All vegetarian restaurants serve vegetarian food. 

A reasonable representation for this sentence would be something like the 
following. 

\/xVegetarianRestaurant(x) => Se rves (x, Veget a ri an Food) 

For this sentence to be true, it must be the case that every substitution of a 
known object for x must result in a sentence that is true. We can divide up the 
set of all possible substitutions into the set of objects consisting of vegetarian 
restaurants and the set consisting of everything else. Let us first consider the 
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case where the substituted object actually is a vegetarian restaurant; one such 
substitution would result in the following sentence. 

VegetarianRestaurant(Maharani) 

=> Serves(Maharani , VegetarianFood) 

If we assume that we know that the consequent clause, 

Serves (Maharani , VegetarianFood) 

is true then this sentence as a whole must be true. Both the antecedent and 
the consequent have the value True and, therefore, according to the first 
two rows of Table 14.3 the sentence itself can have the value True. This 
result will, of course, be the same for all possible substitutions of Terms 
representing vegetarian restaurants for x. 

Remember, however, that for this sentence to be true it must be true 
for all possible substitutions. What happens when we consider a substitu¬ 
tion from the set of a objects that are not vegetarian restaurants? Consider 
the substitution of a non-vegetarian restaurant such as Ay Caramba’s for the 
variable x. 

VegetarianRestaurant(AyCaramba) 

=> Serves (AyCaramba , VegetarianFood) 

Since the antecedent of the implication is False , we can determine 
from Table 14.3 that the sentence is always True , again satisfying the V 
constraint. 

Note, that it may still be the case that Ay Caramba serves vegetarian 
food without actually being a vegetarian restaurant. Note also, that despite 
our choice of examples, there are no implied categorical restrictions on the 
objects that can be substituted for x by this kind of reasoning. In other words, 
there is no restriction of x to restaurants or concepts related to them. Con¬ 
sider the following substitution. 

VegetarianRestaurant(Carburetor) 

=> Serves(Carburetor. VegetarianFood) 

Here the antecedent is still false and hence the rule remains true under this 
kind of irrelevant substitution. 

To review, variables in logical formulas must be either existentially (3) 
or universally (V) quantified. To satisfy an existentially quantified variable, 
there must be at least one substitution that results in a true sentence. Sen¬ 
tences with universally quantified variables must be true under all possible 
substitutions. 
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Inference 


MODUS 

PONENS 


FORWARD 

CHAINING 


One of the most important desiderata given in Section 14.1 for a meaning 
representation language is that it should support inference — the ability to 
add valid new propositions to a knowledge base, or to determine the truth 
of propositions not explicitly contained within a knowledge base. This sec¬ 
tion briefly discusses modus ponens, the most important inference method 
provided by fopc. Applications of modus ponens will be discussed in Chap¬ 
ter 18. 

Modus ponens is a familiar form of inference that corresponds to what 
is informally known as if-then reasoning. We can abstractly define modus 
ponens as follows, where a and P should be taken as fopc formulas. 

a 

a => p 

— r - 

In general, schemas like this indicate that the formula below the line can 
be inferred from the formulas above the line by some form of inference. 
Modus ponens simply states that if the left-hand side of an implication rule 
is present in the knowledge base, then the right-hand side of the rule can be 
inferred. In the following discussions, we will refer to the left hand side of 
an implication as the antecedent, and the right-hand side as the consequent. 

As an example of a typical use of modus ponens, consider the follow¬ 
ing example, which uses a rule from the last section. 

(14.21) 

VegetarianRestaurant (Rudys) 

VxVegetarianRestaurant{x ) Serves(x. VegetarianFood) 

Serves(Rudys, VegetarianFood ) 

Here, the formula VegetarianRestaurant {Rudys) matches the antecedent 
of the rule, thus allowing us to use modus ponens to conclude 
Serves(Rudys, VegetarianFood). 

Modus ponens is typically put to practical use in one of two ways: for¬ 
ward chaining and backward chaining. In forward chaining systems, modus 
ponens is used in precisely the manner just described. As individual facts are 
added to the knowledge base, modus ponens is used to fire all applicable im¬ 
plication rules. In this kind of arrangement, as soon as a new fact is added to 
the knowledge base, all applicable implication rules are found and applied, 
each resulting in the addition new facts to the knowledge base. These new 
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propositions in turn can be used to fire implication rules applicable to them. 

The process continues until no further facts can be deduced. 

The forward chaining approach has the advantage that facts will be 
present in the knowledge base when needed, since in a sense all inference 
is performed in advance. This can substantially reduce the time needed to 
answer subsequent queries since they should all amount to simple lookups. 

The disadvantage of this approach is that facts may be inferred and stored 
that will never be needed. Production systems, which are heavily used syste U m C s I0N 
in cognitive modeling work, are forward chaining inference systems aug¬ 
mented with additional control knowledge that governs which rules are to be 
fired. 

In backward chaining, modus ponens is run in reverse to prove spe- chaining 0 
cific propositions, called queries. The first step is to see if the query formula 
is true by determining if it is present in the knowledge base. If it is not, 
then the next step is to search for applicable implication rules present in the 
knowledge base. An applicable rule is one where the consequent of the rule 
matches the query formula. If there are such any such rules, then the query 
can be proved if the antecedent of any one them can be shown to be true. 

Not surprisingly, this can be performed recursively by backward chaining 
on the antecedent as a new query. The Prolog programming language is a 
backward chaining system that implements this strategy. 

To see how this works, let’s assume that we have been asked to verify 
the truth of the proposition Serves{Rudys,VegetarianFood ), assuming the 
facts given above the line in 14.21. Since it is not present in the knowledge 
base, a search for an applicable rule is initiated that results in the rule given 
above. After substituting, the constant Rudys for the variable x, our next task 
is to prove the antecedent of the rule, VegetarianRestaurant(Rudys), which 
of course is one of the facts we are given. 

Note that it is critical to distinguish between reasoning via backward 
chaining from queries to known facts, and reasoning backwards from known 
consequents to unknown antecedents. To be specific, by reasoning back¬ 
wards we mean that if the consequent of a rule is known to be true, we as¬ 
sume that the antecedent will be as well. For example, let’s assume that we 
know that Serves(Rudys,VegetarianFood ) is true. Since this fact matches 
the consequent of our rule, we might reason backwards to the conclusion 
that VegetarianRestaurant (Rudys ). 

While backward chaining is a sound method of reasoning, reasoning 
backwards is an invalid, though frequently useful, form of plausible rea¬ 
soning . Plausible reasoning from consequents to antecedents is known as 
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14.4 


abduction, and as we will see in Chapter 18 is often useful in accounting for 
many of the inferences people make while analyzing extended discourses. 

While forward and backward reasoning are sound, neither is complete. 
This means that there are valid inferences that can not be found by sys¬ 
tems using these methods alone. Fortunately, there is an alternative infer¬ 
ence technique called resolution that is sound and complete. Unfortunately, 
inference systems based on resolution are far more computationally expen¬ 
sive than forward or backward chaining systems. In practice, therefore, most 
systems use some form of chaining, and place a burden on knowledge base 
developers to encode the knowledge in a fashion that permits the necessary 
inferences to be drawn. 

Some Linguistically Relevant Concepts 

Entire lives have been spent studying the representation of various aspects 
of human knowledge. These efforts have ranged from tightly focused ef¬ 
forts to represent individual domains such as time, to monumental efforts to 
encode all of our commonsense knowledge of the world (Lenat and Guha, 
1991). Our focus here is considerably more modest. This section provides a 
brief overview of the representation of a few important topics that have clear 
implications for language processing. Specifically, the following sections 
provide introductions to the meaning representations of categories, events, 
time, and beliefs. 

Categories 

As we noted in Section 14.2, words with predicate-like semantics often ex¬ 
press preferences for the semantics of their arguments in the form of selec¬ 
tion restrictions. These restrictions are typically expressed in the form of 
semantically-based categories where all the members of a category share a 
set of relevant features. 

The most common way to represent categories is to create a unary 
predicate for each category of interest. Such predicates can then be asserted 
for each member of that category. For example, in our restaurant discussions 
we have been using the unary predicate VegetarianRestaurant as in: 

VegetarianRestaurant (. Maharani ) 

Similar logical formulas would be included in our knowledge base for 
each known vegetarian restaurant. 
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Unfortunately, in this method categories are relations, rather than full- 
fledged objects. It is, therefore, difficult to make assertions about categories 
themselves, rather than about their individual members. For example, we 
might want to designate the most popular member of a given category as in 
the following expression. 

MostPo pular(Maharani, VegetarianRestaurant) 

Unfortunately, this is not a legal fopc formula since the arguments to pred¬ 
icates in fopc must be Terms , not other predicates. 

One way to solve this problem is to represent all the concepts that 
we want to make statements about as full-fledged objects via a tech¬ 
nique called reification. In this case, we can represent the category of reification 
VegetarianRestaurant as an object just as Maharani is. The notion of mem¬ 
bership in such a category is then denoted via a membership relation as in 
the following. 

ISA (Maharani, VegetarianRestaurant) 

The relation denoted by ISA (is a) holds between objects and the cate¬ 
gories in which they are members. This technique can be extended to create 
hierarchies of categories through the use of other similar relations, as in the 
following. 

AKO(VegetarianRestaurant, Restaurant) 

Here, the relation AKO (a kind of) holds between categories and denotes 
a category inclusion relationship. Of course, to truly give these predicates 
meaning they would have to be situated in a larger set of facts defining cate¬ 
gories as sets. 

Chapter 16 discusses the practical use of such relations in databases of 
lexical relations, in the representation of selection restrictions, and in word 
sense disambiguation. 

Events 

The representations for events that we have used until now have consisted of 
single predicates with as many arguments as are needed to incorporate all the 
roles associated with a given example. For example, the representation for 
making a reservation discussed in Section 14.2 consisted of a single pred¬ 
icate with arguments for the person making the reservation, the restaurant, 
the day, the time, and the number of people in the party, as in the following. 

Reservation(Hearer ; Maharani, Today, 8 PM, 2) 
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In the case of verbs, this approach simply assumes that the predicate rep¬ 
resenting the meaning of a verb has the same number of arguments as are 
present in the verb’s syntactic subcategorization frame. 

Unfortunately, there are three problems with this approach that make 
it awkward to apply in practice: 

• Determining the correct number of roles for any given event. 

• Representing facts about the roles associated with an event. 

• Ensuring that all the correct inferences can be derived directly from the 
representation of an event. 

• Ensuring that no incorrect inferences can be derived from the represen¬ 
tation of an event. 

We will explore these, and other related issues, by considering a series 
of representations for events. This discussion will focus on the following 
examples of the verb eat. 

(14.22) late. 

(14.23) I ate a turkey sandwich. 

(14.24) I ate a turkey sandwich at my desk. 

(14.25) I ate at my desk. 

(14.26) I ate lunch. 

(14.27) I ate a turkey sandwich for lunch. 

(14.28) I ate a turkey sandwich for lunch at my desk. 

Clearly, the variable number of arguments for a predicate-bearing verb 
like eat poses a tricky problem. While we would like to think that all of these 
arity examples denote the same kind of event, predicates in fopc have fixed arity 
— they take a fixed number of arguments. 

One possible solution is suggested by the way that examples like these 
are handled syntactically. The solution given in Chapter 11 was to create 
one subcategorization frame for each of the configurations of arguments that 
a verb allows. The semantic analog to this approach is to create as many 
different eating predicates as are needed to handle all of the ways that eat 
behaves. Such an approach would yield the following kinds of representa- 
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tions for Examples 14.22 through 14.22. 

Eatingi {Speaker) 

Eating 2 ( Speaker , T urkeySandwich) 

Eating 3 ( Speaker , T urkeySandwich , Desk) 

E ating 4 (S peake r. Desk) 

Eatings ( Speaker , Lunch) 

Eatings ( Speaker , LurkeySandwich , Lunch) 

Eating 7 ( Speaker , T urkeySandwich , Lunch , Desk) 

This approach simply sidesteps the issue of how many arguments the 
Eating predicate should have by creating distinct predicates for each of the 
subcategorization frames. Unfortunately, this approach comes at a rather 
high cost. Other than the suggestive names of the predicates, there is noth¬ 
ing to tie these events to one another even though there are obvious logical 
relations among them. Specifically, if Example 14.28 is true then all of the 
other examples are true as well. Similarly, if Example 14.27 is true then 
Examples 14.22, 14.23 and 14.26 must also be true. Such logical connec¬ 
tions can not be made on the basis of these predicates alone. Moreover, we 
would expect a commonsense knowledge base to contain logical connections 
between concepts like Eating and related concepts like Hunger and Food. 

One method to solve these problems involves the use of what are called 
meaning postulates. Consider the following example postulate. 

Vw, x, y, z Eating 7 (w, x, y, z) => Eatings (w, x, y) 

This postulate explicitly ties together the semantics of two of our predicates. 
Other postulates could be created to handle the rest of the logical relations 
among the various Eatings and the connections from them to other related 
concepts. 

Although such an approach might be made to work in small domains, 
it clearly has scalability problems. A somewhat more sensible approach is to 
say that Examples 14.22 through 14.28 all reference the same predicate with 
some of the arguments missing from some of the surface forms. Under this 
approach, as many arguments are included in the definition of the predicate 
as ever appear with it in an input. Adopting the structure of a predicate 
like Eatingi as an example would give us a predicate with four arguments 
denoting the eater, thing eaten, meal being eaten and the location of the 
eating. The following formulas would then capture the semantics of our 
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examples. 

3w,x,y Eating(Speaker,w,x, y) 

3w,x Eating(Speaker,Turkey Sandwich, w,x) 

3w Eating{Speaker , TurkeySandwich, iv, ZMvfc) 
x Eating(Speaker, iv, x, 
x mg(S peake r , iv, Lunch, x) 

3w Eating{Speaker ; TurkeySandwich, Lunch, w) 

Eating(Speaker ; 7" urkeySandwich, Lunch, Dc.v/c) 

This approach directly yields the obvious logical connections among 
these formulas without the use of meaning postulates. Specifically, all of the 
sentences with ground terms as arguments logically imply the truth of the 
formulas with existentially bound variables as arguments. 

Unfortunately, this approach still has at least two glaring deficiencies: 
it makes too many commitments, and it does not let us individuate events. 
As an example of how it makes too many commitments, consider how we 
accommodated the for lunch complement in Examples 14.26 through 14.28; 
a third argument, the meal being eaten, was added to the Eating predicate. 
The presence of this argument implicitly makes it the case that all eating 
events are associated with a meal (ie. breakfast, lunch, or dinner). More 
specifically, the existentially quantified variable for the meal argument in the 
above examples states that there is some formal meal associated with each 
of these eatings. This is clearly silly since one can certainly eat something 
independent of it being associated with a meal. 

To see how this approach fails to properly individuate events, consider 
the following formulas. 

3w, x Eating(Speaker, w, x, Desk) 

3w,x Eating(S peake r, w, Lunch, x) 

3w, x Eating(Speaker, w, Lunch, Desk) 

If we knew that the first two formula were referring to the same event, they 
could be combined to create the third representation. Unfortunately, with 
the current representation we have no way of telling if this is possible. The 
independent facts that / ate at my desk and / ate lunch do not permit us to 
conclude that / ate lunch at my desk. Clearly what is lacking is some way of 
referring to the events in question. 

As with categories, we can solve these problems if we employ reifica¬ 
tion to elevate events to objects that can be quantified and related to a other 
objects via sets of defined relations (Davidson, 1967; Parsons, 1990). Con- 
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sider the representation of Example 14.23 under this kind of approach. 

3w I SA(w. Eating) 

AEater(w , Speaker) A Eaten(w , Turkey Sandwich) 

This representation states that there is an eating event where the Speaker 
is doing the eating and a Turkey Sandwich is being eaten. The meaning rep¬ 
resentations for Examples 14.22 and 14.27 can be constructed similarly. 

3w ISA(w, Eating) AEater(w, Speaker) 

3w ISA(w, Eating) 

AEater(w , Speaker) A Eaten(w , Turkey Sandwich) 

AMealEaten(w , Lunch) 

Under this reified-event approach: 

• There is no need to specify a fixed number of arguments for a given 
surface predicate, rather as many roles and fillers can be glued on as 
appear in the input. 

• No more roles are postulated than are mentioned in the input. 

• The logical connections among closely related examples is satisfied 
without the need for meaning postulates. 

Representing Time 

In the preceding discussion of events, we did not address the issue of repre¬ 
senting the time when the represented events are supposed to have occurred. 
The representation of such information in a useful form is the domain of 
temporal logic. This discussion will serve to introduce the most basic con¬ 
cerns of temporal logic along with a brief discussion of the means by which 
human languages convey temporal information, which among other things 
includes tense logic, the ways that verb tenses convey temporal information. 

The most straightforward theory of time hold that it flows inexorably 
forward, and that events are associated with either points or intervals in time, 
as on a timeline. Given these notions, an ordering can be imposed on distinct 
events by situating them on the timeline. More specifically, we can say that 
one event precedes another, if the flow of time leads from the first event to 
the second. Accompanying these notions in most theories is the idea of the 
current moment in time. Combining this notion with the idea of a temporal 
ordering relationship yields the familiar notions of past, present and future. 

Not surprisingly, there are a large number of schemes for representing 
this kind of temporal information. The one presented here is a fairly simple 
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one that stays within the fopc framework of reified events that we have been 
pursuing. Consider the following examples. 

(14.29) I arrived in New York. 

(14.30) I am arriving in New York. 

(14.31) I will arrive in New York. 

These sentences all refer to the same kind of event and differ solely in the 
tense of the verb. In our current scheme for representing events, all three 
would share the following kind of representation, which lacks any temporal 
information. 

3 w ISA(w, Arriving) 

AArriver(w. Speaker) A Dest inat ion(w. NewYork) 

The temporal information provided by the tense of the verbs can be 
exploited by predicating additional information about the event variable w. 
Specifically, we can add temporal variables representing the interval corre¬ 
sponding to the event, the end point of the event, and temporal predicates 
relating this end point to the current time as indicated by the tense of the 
verb. Such an approach yields the following representations for our arriving 
examples. 

3/, e , w, 1I SA(w, Arriving) 

AArriver(w, Speaker) A Destination(w,NewYork) 
IntervalOf(w , i) A EndPoint(i , e) A Precedes(e,Now) 

3/, e , w, 11SA(w, Arriving) 

AArriver(w. Speaker) A Destination(w,NewYork) 
IntervalOf{w , /) AMemberOf(i,Now) 

3/, e , w, 11SA(w, Arriving) 

AArriver(w, Speaker) A Dest inat ion(w, NewYork) 
IntervalOf(w , /) A EndPoint(i , e) A Precedes(Now , e) 

This representation introduces a variable to stand for the interval of time as¬ 
sociated with the event, and a variable that stands for the end of that interval. 
The two-place predicate Precedes represents the notion that the first time 
point argument precedes the second in time; the constant Now refers to the 
current time. For past events, the end point of the interval must precede the 
current time. Similarly, for future events the current time must precede the 
end of the event. For events happening in the present, the current time is 
contained within the event interval. 
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Unfortunately, the relation between simple verb tenses and points in 
time is by no means straightforward. Consider the following examples. 

(14.32) Ok, we fly from San Francisco to Boston at 10. 

(14.33) Flight 1390 will be at the gate an hour now. 

In the first example, the present tense of the verb fly is used to refer to a 
future event, while in the second the future tense is used to refer to a past 
event. 

More complications occur when we consider some of the other verb 
tenses. Consider the following examples. 

(14.34) Flight 1902 arrived late. 

(14.35) Flight 1902 had arrived late. 

Although both refer to events in the past, representing them in the same way 
seems wrong. The second example seems to have another unnamed event 
lurking in the background (eg. Flight 1902 had already arrived late when 
something else happened). To account for this phenomena, Reichenbach 
(1947) introduced the notion of a reference point. In our simple temporal 
scheme, the current moment in time is equated with the time of the utterance, 
and is used as a reference point for when the event occurred (before, at, 
or after). In Reichenbach’s approach, the notion of the reference point is 
separated out from the utterance time and the event time. The following 
examples illustrate the basics of this approach. 

(14.36) When Mary’s flight departed, I ate lunch. 

(14.37) When Mary’s flight departed, I had eaten lunch. 

In both of these examples, the eating event has happened in the past, 
ie. prior to the utterance. However, the verb tense in the first example indi¬ 
cates that the eating event began when the flight departed, while the second 
example indicates that the eating was accomplished prior to the flight’s de¬ 
parture. Therefore, in Reichenbach’s terms the departure event specifies the 
reference point. These facts can be accommodated by asserting additional 
constraints relating the eating and departure events. In the first example, the 
reference point precedes the eating event, and in the second example, the 
eating precedes the reference point. Figure 14.4 illustrates Reichenbach’s 
approach with the primary English tenses. Exercise 14.9 asks you to repre¬ 
sent these examples in FOPC. 

This discussion has focused narrowly on the broad notions of past, 
present, and future and how they are signaled by verb tenses. Of course, 


REFERENCE 

POINT 



526 


Chapter 14. Representing Meaning 


Past Perfect 

Simple Past 

Present Perfect 

I had eaten. 

I ate. 

I have eaten. 

1 1 

E R 

t*' i r* 

U R,E U 

1 1 * 

E R,U 

Present 

Simple Future 

Future Perfect 

I eat. 

I will eat. 

I will have eaten. 

1 

U,R,E 

i 

U,R E 

"7 ; i 

u E R 

Figure 14.4 Reichenbach’s approach applied to various English tenses. In 
these diagrams, time flows from left to right, an E denotes the time of the event, 
an R denotes the reference time, and an U denotes the time of the utterance. 


languages also have many other more direct and more specific ways to con¬ 
vey temporal information, including the use of a wide variety of temporal 
expressions as in the following ATIS examples. 

(14.38) I’d like to go at 6:45, in the morning. 

(14.39) Somewhere around noon, please. 

(14.40) Later in the afternoon, near 6pm. 

As we will see in the next chapter, grammars for such temporal expressions 
are of considerable practical importance in information extraction and ques¬ 
tion answering applications. 

Finally, we should note that there is a systematic conceptual organiza¬ 
tion reflected in examples like these. In particular, temporal expressions in 
English are frequently expressed in spatial terms, as is illustrated by the var¬ 
ious uses of at , in, somewhere and near in these examples (Lakoff and John¬ 
son, 1980; Jackendoff, 1983a). Metaphorical organizations such as these, 
where one domain is systematically expressed in terms of another, will be 
discussed in more detail in Chapter 16. 

Aspect 

In the last section, we discussed ways to represent the time of an event with 
respect to the time of an utterance describing it. In this section, we address 
aspect the notion of aspect, which concerns a cluster of related topics, including 
whether an event has ended or is ongoing, whether it is conceptualized as 
happening at a point in time or over some interval, and whether or not any 
particular state in the world comes about because of it. Based on these and 
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related notions, event expressions have traditionally been divided into four 
general classes: statives, activities, accomplishments, and achievements. 

The following examples provide prototypical instances of each class. 

Stative: I know my departure gate. 

Activity: John is flying. 

Accomplishment: Sally booked her flight. 

Achievement: She found her gate. 

Although the earliest versions of this classification were discussed by Aristo¬ 
tle, the one presented here is due to Yendler (1967). In the following discus¬ 
sion, we’ll present a brief characterization of each of the four classes, along 
with some diagnostic techniques suggested in Dowty (1979) for identifying 
examples of each kind. 

Stative expressions represent the notion of an event participant having stative 
a particular property, or being in a state, at a given point in time. As such, 
they can be thought of as capturing an aspect of a world at a single point in 
time. Consider the following ATIS examples. 

(14.41) I like Flight 840 arriving at 10:06. 

(14.42) I need the cheapest fare. 

(14.43) I have a round trip ticket for $662. 

(14.44) I want to go first class. 

In examples like these, the event participant denoted by the subject can be 
seen as experiencing something at a specific point in time. Whether or not 
the experiencer was in the same state earlier, or will be in the future is left 
unspecified. 

There are a number of diagnostic tests for identifying statives. As an 
example, stative verbs are distinctly odd when used in the progressive form. 

(14.45) *1 am needing the cheapest fare on this day. 

(14.46) *1 am wanting to go first class. 

We should note that in these and subsequent examples, we are using an * to 
indicate a broadened notion of ill-formedness that may include both semantic 
and syntactic factors. 

Statives are are also odd when used as imperatives. 

(14.47) *Need the cheapest fare! 

Finally, statives are not easily modified by adverbs like deliberately 
and carefully . 
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(14.48) *1 deliberately like Flight 840 arriving at 10:06. 

(14.49) *1 carefully like Flight 840 arriving at 10:06. 

Activity expressions describe events undertaken by a participant that 
have no particular end-point. Unlike statives, activities are seen as occurring 
over some span of time, and are therefore not associated with single points 
in time. Consider the following examples. 

(14.50) She drove a Mazda. 

(14.51) I live in Brooklyn. 

These examples both specify that the subject is engaged in, or has engaged 
in, the activity specified by the verb for some period of time. 

Unlike statives, activity expressions are fine in both the progressive and 
imperative forms. 

(14.52) She is living in Brooklyn. 

(14.53) Drive a Mazda! 

However, like statives, activity expressions are odd when temporally 
modified with temporal expressions using in. 

(14.54) *1 live in Brooklyn in a month. 

(14.55) *She drove a Mazda in an hour. 

They can, however, successfully be used with for temporal adverbials, as in 
the following examples. 

(14.56) I live in Brooklyn for a month. 

(14.57) She drove a Mazda for an hour. 

Unlike activities, accomplishment expressions describe events that 
have a natural end-point and result in a particular state. Consider the fol¬ 
lowing examples. 

(14.58) He booked me a reservation. 

(14.59) United flew me to New York. 

In these examples, there is an event that is seen as occurring over some period 
of time that ends when the intended state is accomplished. 

A number of diagnostics can be used to distinguish accomplishment 
events from activities. Consider the following examples, which make use of 
the word stop as a test. 

(14.60) I stopped living in Brooklyn. 

(14.61) She stopped booking my flight. 
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In the first example, which is an activity, one can safely conclude that the 
statement I lived in Brooklyn even though this activity came to an end. How¬ 
ever, from the second example, one can not conclude the statement She 
booked her flight, since the activity was stopped before the intended state was 
accomplished. Therefore, although stopping an activity entails that the ac¬ 
tivity took place, stopping an accomplishment event indicates that the event 
did not succeed. 

Activities and accomplishments can also be distinguished by by how 
they can be modified by various temporal adverbials. Consider the following 
examples. 

(14.62) *1 lived in Brooklyn in a year. 

(14.63) She booked a flight in a minute. 

In general, accomplishments can be modified by in temporal expressions, 
while simple activities can not. 

The final aspectual class, achievements, are similar to accomplish- ments e 
ments in that they result in a state. Consider the following examples. 

(14.64) She found her gate. 

(14.65) I reached New York. 

Unlike accomplishments, achievement events are thought of as happening in 
an instant, and are not equated with any particular activity leading up to the 
state. To be more specific, the events in these examples may have been pre¬ 
ceded by extended searching or traveling events, but the events correspond¬ 
ing directly to found and reach are conceived of as points not intervals. 

The point-like nature of these events has implications for how they can 
be temporally modified. In particular, consider the following examples. 

(14.66) I lived in New York for a year. 

(14.67) *1 reached New York for a few minutes. 

Unlike activity and accomplishment expressions, achievements can not be 
modified by for adverbials. 

Achievements can also be distinguished from accomplishments by em¬ 
ploying the word stop , as we did earlier. Consider the following examples. 

(14.68) I stopped booking my flight. 

(14.69) *1 stopped reaching New York. 

As we saw earlier, using stop with an accomplishment expression results 
in a failure to reach the intended state. Note, however, that the resulting 
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expression is perfectly well-formed. On the other hand, using stop with an 
achievement example is unacceptable. 

We should note that since both accomplishments and achievements are 
events that result in a state, they are sometimes characterized as sub-types of 
a single aspectual class. Members of this combined class are known as telic 
tuauties n ' eventualities. 

Before moving on, we should make two points about this classification 
scheme. The first point is that event expressions can easily be shifted from 
one class to another. Consider the following examples. 

(14.70) I flew. 

(14.71) I flew to New York. 

The first example is a simple activity; it has no natural end-point and can 
not be temporally modified by in temporal expressions. On the other hand, 
the second example is clearly an accomplishment event since it has an end¬ 
point, results in a particular state, and can be temporally modified in all the 
ways that accomplishments can. Clearly the classification of an event is not 
solely governed by the verb, but by the semantics of the entire expression in 
context. 

The second point is that while classifications such as this one are often 
useful, they do not explain why it is that events expressed in natural lan¬ 
guages fall into these particular classes. We will revisit this issue in Chap¬ 
ter 16 where we will sketch a representational approach due to Dowty (1979) 
that accounts for these classes. 

Representing Beliefs 

There are a fair number of words and expressions that have what might be 
called a world creating ability. By this, we mean that their meaning repre¬ 
sentations contain logical formulas that are not intended to taken as true in 
the real world, but rather as part of some kind of hypothetical world. In addi¬ 
tion, these meaning representations often denote a relation from the speaker, 
or some other entity, to this hypothetical world. Examples of words that 
have this ability are believe , want , imagine and know. World-creating words 
generally take various sentence-like constituents as arguments. 

Consider the following example. 

(14.72) I believe that Mary ate British food. 

Applying our event-oriented approach we would say that there two events 
underlying this sentence: a believing event relating the speaker to some spe- 
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cific belief, and an eating event that plays the role of the believed thing. 
Ignoring temporal information, a straightforward application of our reified 
event approach would produce the following kind of representation. 

3m, v I SA(u, Believing) A ISA(v, Eating) 

ABeliever(u , Speaker) A BelievedProp(u , v) 

A Eater(v,Mary) A Eaten(v , BritishFood) 

This seems relatively straightforward, all the right roles are present and 
the two events are tied together in a reasonable way. Recall, however, that 
in conjunctive representations like this all of the individual conjuncts must 
be taken to be true. In this case, this results in a statement that there actually 
was an eating of British food by Mary. Specifically, by breaking this for¬ 
mula apart into separate formulas by conjunction elimination the following 
formula can be produced. 

3 v ISA(y, Eating) 

AEater(v,Mary) A Eaten(v , BritishFood) 

This is clearly more than we want to say. The fact that the speaker believes 
this proposition does not make it true; it is only true in the world represented 
by the speaker’s beliefs. What is needed is a representation that has a struc¬ 
ture similar to this, but where the Eating event is given a special status. 

Note that reverting to the simpler predicate representations we used 
earlier in this chapter does not help. A common mistake using such rep¬ 
resentations would be to represent this sentence with the following kind of 
formula. 

Believing(Speaker,Eating(Mary , BritishFood)) 

The problem with this representation is that it is not even valid FOPC. The 
second argument to the Believing predicate should be a FOPC term, not a 
formula. This syntactic error reflects a deeper semantic problem. Predicates 
in fopc hold between the objects in the domain being modeled, not between 
the relations that hold among the objects in the domain. Therefore, fopc 
lacks a meaningful way to assert relations about full propositions, which is 
unfortunately exactly what words like believe , want , imagine and know want 
to do. 

The standard method for handling this situation is to augment fopc 
with operators that allow us to make statements about full logical formu¬ 
las. Let’s consider how this approach might work in the case of Example 
14.72. We can introduce an operator called Believes that takes two fopc 
formulas as its arguments: a formula designating a believer, and a formula 
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designating the believed proposition. Applying this operator would result in 
the following meaning representation. 

Believes(Speaker , 3vISA(v. Eating) 

AEater(v,Mary) A Eaten(v , BritishFood ) 

Under this approach, the contribution of the word believes to this mean¬ 
ing representation is not a fopc proposition at all, but rather an operator that 
is applied to the believed proposition. Therefore, as we discuss in Chap¬ 
ter 15, these world creating verbs play quite a different role in the semantic 
analysis than more ordinary verbs like eat. 

As one might expect, keeping track of who believes what about whom 
at any given point in time gets rather complex. As we will see in Chapter 18, 
this is an important task in interactive systems that must track users’ beliefs 
as they change during the course of a dialog. 

Operators like Believes that apply to logical formulas are known as 
modal operators. Correspondingly, a logic augmented with such operators 
is known as a modal logic. Modal logics have found many uses in the rep¬ 
resentation of commonsense knowledge in addition to the modeling of be¬ 
lief, among the more prominent are representations of time and hypothetical 
worlds. 

Not surprisingly, modal operators and modal logics raise a host of com¬ 
plex theoretical and practical problems that we can not even begin to do jus¬ 
tice to here. Among the more important issues are the following. 

• How inference works in the presence of specific modal operators. 

• The kinds of logical formula that particular operators can be applied 
to. 

• How modal operators interact with quantifiers and logical connectives. 

• The influence of these operators on the equality of terms across formu¬ 
las. 

The last issue in this list has consequences for modeling agent’s knowl¬ 
edge and beliefs in dialog systems and deserves some elaboration here. In 
standard fopc systems, logical terms that are known to be equal to one an¬ 
other can be freely substituted without having any effect on the truth of sen¬ 
tences they occur in. Consider the following examples 

(14.73) Snow has delayed Flight 1045. 

(14.74) John’s sister’s flight serves dinner. 

Assuming that these two flights are the same, substituting Flight 1045 for 
John's sister's flight has no effect on the truth of either sentence. 



Section 14.4. Some Linguistically Relevant Concepts 


533 


Now consider, the following variation on the first example. 

(14.75) John knows that snow has delayed Flight 1045. 

(14.76) John knows that his sister’s flight serves dinner. 

Here the substitution does not work. John may well know that Flight 1045 
has been delayed without knowing that his sister’s flight is delayed, simply 
because he may not know the number of his sister’s flight. In other words, 
even if we assume that these sentences are true, and that John’s sister is 
on Flight 1045, we can not say anything about the truth of the following 
sentence. 


(14.77) John knows that snow has delayed his sister’s flight. 


Settings like this where a modal operator like Know is involved are 
called referentially opaque. In referentially opaque settings, substitution of 
equal terms may or may not succeed. Ordinary settings where such substitu¬ 
tions always work are said to be referentially transparent. 


REFEREN¬ 
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ENT 


Pitfalls 


As noted in Section 14.3, there are a number of common mistakes in rep¬ 
resenting the meaning of natural language utterances, that arise from con¬ 
fusing, or equating, elements from real languages with elements in fopc. 
Consider the following example, which on the surface looks like a standard 
implication rule. 

(14.78) If you’re interested in baseball, the Rockies are playing tonight. 

A straightforward translation of this sentence into fopc might look some¬ 
thing like this. 

Havelnterest In(Hearer, Baseball) 

=> Playing (Rockies, Tonight) 

This representation is flawed for a large number of reasons. The most ob¬ 
vious ones arise from the semantics of fopc implications. In the event that 
the hearer is not interested in baseball, this formula becomes meaningless. 
Specifically, we can not draw any conclusion about the consequent clause 
when the antecedent is false. But of course this is a ridiculous conclusion, 
we know that the Rockies game will go forward regardless of whether or not 
the hearer happens to like baseball. Exercise 14.10 asks you to come up with 
a more reasonable fopc translation of this example. 

Now consider the following example. 

(14.79) One more beer and I’ll fall off this stool. 
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Again, a simple-minded translation of this sentence might consist of a con¬ 
junction of two clauses: one representing a drinking event and one represent¬ 
ing a falling event. In this case, the surface use of the word and obscures the 
fact that this sentence instead has an implication underlying it. The lesson 
of both of these examples is that English words like and , or and if are only 
tenuously related to the elements of fopc with the same names. 

Along the same lines, it is important to remember the complete lack 
of significance of the names we make use of in representing fopc formulas. 
Consider the following constant. 


InexpensiveVegetarianlndianFoodOnTuesdays 


Despite its impressive morphology, this term, by itself, has no more meaning 
than a constant like X99 would have. See McDermott (1976) for a discourse 
on the inherent dangers of such naming schemes. 


14.5 Related Representational Approaches 

Over the years, a fair number of representational schemes have been invented 
to capture the meaning of linguistic utterances for use in natural language 
processing systems. Other than logic, two of the most widely used schemes 
networks have been Semantic Networks and Frames, which are also known as slot- 

frames filler representations. The kl-one (Brachman and Schmolze, 1985a), and 

krl (Bobrow and Winograd, 1977) systems represent influential efforts to 
represent knowledge for use in natural language processing systems. 

In semantic networks, objects are represented as nodes in a graph, with 
relations between objects being represented by named links. In frame-based 
systems, objects are represented as feature-structures similar to those dis¬ 
cussed in Chapter 11, which can, of course, also be naturally represented as 
graphs. In this approach features are called slots and the values, or fillers, 
of these slots can either be atomic values or other embedded frames. The 
following diagram illustrates how Example 14.72 might be captured in a 
frame-based approach. 


I believe Mary ate British food. 
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It is now widely accepted that meanings represented in these approaches can 
be translated into equivalent statements in fopc with relative ease. 

14.6 Alternative Approaches to Meaning 


The notion that the translation of linguistic inputs into a formal representa¬ 
tion made up of discrete symbols adequately captures the notion of meaning 
is, not surprisingly, subject to a considerable amount of debate. The follow¬ 
ing sections give brief, wholly inadequate, overviews of some of the major 
concerns in these debates. 

Meaning as Action 

An approach that holds considerable appeal when we consider the seman¬ 
tics of imperative sentences is the notion of meaning as action. Under this 
view, utterances are viewed as actions, and the meanings of these utterances 
resides in procedures that are activated in the hearer as a result of hearing 
the utterance. This approach was followed in the creation of the histori¬ 
cally important SHRDLU system, and is summed up well by its creator Terry 
Winograd (1972b). 

One of the basic viewpoints underlying the model is that all lan¬ 
guage use can be thought of as a way of activating procedures 
within the hearer. We can think of an utterance as a program - 
one that indirectly causes a set of operations to be carried out 
within the hearer’s cognitive system. 

A recent procedural model of semantics is the executing schema or 
x-schema model of Bailey et al (1997), Narayanan (1997a, 1997b), and 
Chang et al (1998). The intuition of this model is that various parts of the 
semantics of events, including the aspectual factors discussed on 526, are 
based on schematized descriptions of sensory-motor processes like incep¬ 
tion, iteration, enabling, completion, force, and effort. The model represents 
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the aspectual semantics of events via a kind of probabilistic automaton called 
a Petri net (Murata, 1989). The nets used in the model have states like ready , 
process , finish , suspend , and result 

The meaning representation of an example like Jack is walking to the 
store activates the process state of the walking event. An accomplishment 
event like Jack walked to the store activates the result state. An iterative 
activity like Jack walked to the store every week is simulated in the model 
by an iterative activation of the process and result nodes. This idea of using 
sensory-motor primitives as a foundation for semantic description is also 
based on the work of Regier (1996) on the role of visual primitives in a 
computational model of learning the semantics of spatial prepositions. 

Meaning as Truth 

The role of formal meaning representations in linguistics, natural language 
processing, artificial intelligence, and cognitive modeling, is quite different 
from its role in more philosophical circles. In the former approaches, the 
name of the game is getting from linguistic inputs to appropriate, unambigu¬ 
ous, and operationally useful representations. 3 

To philosophers, however, the mere translation of a sentence from its 
original natural form to another artificial form does not get us any closer to its 
meaning (Lewis, 1972). Formal representations may facilitate real semantic 
work, but are not by themselves of much interest. Under this view, the im¬ 
portant work is in the functions, or procedures, that determine the mapping 
from these representations to the world being modeled. Of particular interest 
conditions i n these approaches are the functions that determine the truth conditions of 
sentences, or their formal representations. 


14.7 Summary 

This chapter has introduced the representational approach to meaning. The 
following are some of the highlights of this chapter. 

• A major approach to meaning in computational linguistics involves the 
creation of formal meaning representations that capture the meaning- 
related content of linguistic inputs. These representations are intended 
to bridge the gap from language to commonsense knowledge of the 


3 Of course, what counts as useful varies considerably among these areas 
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world. 

• The frameworks specify the syntax and semantics of these representa¬ 
tions are called meaning representation languages. A wide variety of 
such languages are used in natural language processing and artificial 
intelligence. 

• Such representations need to be able to support the practical compu¬ 
tational requirements of semantic processing. Among these are the 
need to determine the truth of propositions, to support unambiguous 
representation, to represent variables, to support inference, and to be 
expressive. 

• Human languages have a wide variety of features that are used to con¬ 
vey meaning. Among the most important of these is the ability to con¬ 
vey a predicate-argument structure. 

• FOPC is a well-understood computationally tractable meaning repre¬ 
sentation language that offers much of what is needed in a meaning 
representation language. 

• Important classes of meaning including categories, events, and time 
can be captured in FOPC. Propositions corresponding to such concepts 
as beliefs and desires require extensions to fopc including modal op¬ 
erators. 

• Semantic networks and frames can be captured within the FOPC frame¬ 
work. 


Bibliographical and Historical Notes 

The earliest computational use of declarative meaning representations in 
natural language processing was in the context of question-answering sys¬ 
tems (Green et al , 1963; Raphael, 1968; Lindsey, 1963). These systems 
employed ad-hoc representations for the facts needed to answer questions. 
Questions were then translated into a form that could be matched against 
facts in the knowledge base. Simmons (1965) provides an overview of these 
early efforts. 

Woods (1967) investigated the use of FOPC-like representations in question¬ 
answering as a replacement for the ad-hoc representations in use at the time. 
Woods (1973) further developed and extended these ideas in the landmark 
Lunar system. Interestingly, the representations used in Lunar had both a 
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truth-conditional and a procedural semantics. Winograd (1972b) employed 
a similar representation based on the Micro-Planner language in his SHRDLU 
system. 

During this same period, researchers interested in the cognitive model¬ 
ing of language and memory had been working with various forms of asso¬ 
ciative network representations. Masterman (1957) was probably the first to 
make computational use of a semantic network-like knowledge representa¬ 
tion, although semantic networks are generally credited to Quillian (1968). 

A considerable amount work in the semantic network framework was carried 
out during this era (Norman and Rumelhart, 1975; Schank, 1972; Wilks, 
1975c, 1975b; Kintsch, 1974). It was during this period that a number of 
researchers began to incorporate Fillmore’s notion of case roles (Fillmore, 
1968) into their representations. Simmons (1973a) was the earliest adopter 
of case roles as part of representations for natural language processing. 

Detailed analyses by Woods (1975) and Brachman and Schmolze (1985a) 
aimed at figuring out what semantic networks actually mean led to the devel¬ 
opment of a number of more sophisticated network-like languages including 
krl (Bobrow and Winograd, 1977) and kl-one (Brachman and Schmolze, 
1985a). As these frameworks became more sophisticated and well-defined 
it became clear that they were restricted variants of fopc coupled with spe¬ 
cialized inference procedures. A useful collection of papers covering much 
of this work can be found in (Brachman and Levesque, 1985). Russell and 
Norvig (1995) describe a modem perspective on these representational ef¬ 
forts. 

Linguistic efforts to assign semantic structures to natural language sen¬ 
tences in the generative era began with the work of Katz and Fodor (1963). 
The limitations of their simple feature-based representations and the natu¬ 
ral fit of logic to many of linguistic problems of the day quickly led to the 
adoption of a variety of predicate-argument structures as preferred semantic 
representations (Lakoff, 1972; McCawley, 1968). The subsequent introduc¬ 
tion by Montague (1973) of truth-conditional model-theoretic framework 
into linguistic theory led to a much tighter integration between theories of 
formal syntax and a wide range of formal semantic frameworks. Good in¬ 
troductions to Montague semantics and its role in linguistic theory can be 
found in (Dowty et a/., 1981; Partee, 1976). 

The representation of events as reified objects is due to Davidson (1967). 
The approach presented here, which explicitly reifies event participants, is 
due to Parsons (1990). The use of modal operators and modal logic in the 
representation of knowledge and belief is due to Hintikka (1969a). Moore 



Section 14.7. Summary 


539 


(1977) was the first to make computational use of this approach. Faucon- 
nier (1985) deals with a wide range of issues relating to beliefs and belief 
spaces from a cognitive science perspective. Most current computational 
approaches to temporal reasoning are based on Allen’s notion of temporal 
intervals (Allen, 1984). ter Meulen (1995) provides a modem treatment of 
tense and aspect. Davis (1990) describes the use of fopc to represent knowl¬ 
edge across a wide range of common sense domains including quantities, 
space, time, and beliefs. 

A recent comprehensive treatment of logic and language can be found 
in (van Benthem and ter Meulen, 1997). The classic semantics text is (Lyons, 
1977). McCawley (1993) is an indispensable textbook covering a wide range 
of topics concerning logic and language. Chierchia and McConnell-Ginet 
(1991) also provides broad coverage of semantic issues from a linguistic 
perspective. Heim and Kratzer (1998) is a more recent text written from the 
perspective of current generative theory. 


Exercises 

14.1 Choose a recipe from your favorite cookbook and try to make explicit 
all the common-sense knowledge that would be needed to follow it. 

14.2 Proponents of information retrieval occasionally claim that natural 
language texts in their raw form are a perfectly suitable source of knowledge 
for question answering. Sketch an argument against this claim. 

14.3 Peruse your daily newspaper for three examples of ambiguous sen¬ 
tences. Describe the various sources of the ambiguities. 

14.4 Consider a domain where the word coffee can refer to the follow¬ 
ing concepts in a knowledge-base: a caffeinated or decaffeinated beverage, 
ground coffee used to make either kind of beverage, and the beans them¬ 
selves. Give arguments as to which of the following uses of coffee are am¬ 
biguous and which are vague. 

a. I’ve had my coffee for today. 

b. Buy some coffee on your way home. 
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c. Please grind some more coffee. 

14.5 Encode in fopc as much of the knowledge as you can that you came 
up with for Exercise 14.1 

14.6 The following rule, which we gave as a translation for Example 14.20, 
is not a reasonable definition of what it means to be a vegetarian restaurant. 

' ixVegetarianRestaurant{x ) => Serves(x. VegetarianFood) 

Give a fopc rule that better defines vegetarian restaurants in terms of what 
they serve. 

14.7 Give a fopc translations for the following sentences: 

a. Vegetarians do not eat meat. 

b. Not all vegetarians eat eggs. 

14.8 Give a set of facts and inferences necessary to prove the following 
assertions: 

a. McDonalds is not a vegetarian restaurant. 

b. Some vegetarians can eat at McDonalds. 

Don’t just place these facts in your knowledge-base. Show that they 
can be inferred from some more general facts about vegetarians and Mc¬ 
Donalds 

14.9 Give fopc translations for the following sentences that capture the 
temporal relationships between the events. 

a. When Mary’s flight departed, I ate lunch. 

b. When Mary’s flight departed, I had eaten lunch. 

14.10 Give a reasonable fopc translation of the following example. 

If you’re interested in baseball, the Rockies are playing tonight. 

14.11 On Page 512 we gave the following fopc translation for Example 
14.17. 

Have(Speaker,FiveDollars ) A -i Have (Speaker, Lot O/Time) 

This literal representation would not be particularly useful to a restaurant- 
oriented question answering system. Give a deeper fopc meaning represen¬ 
tation for this example that is closer to what it really means. 
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14.12 Describe, in English, the knowledge that would be needed to infer 
the deeper representation you produced for the last exercise from the initial 
literal representation. 

14.13 On Page 512, we gave the following representation as a translation 
for the sentence Ay Caramba is near ICSL 

Near(LocationOf(AyCaramba ), LocationOf(ICSI )) 

In our truth-conditional semantics, this formula is either true or false given 
the contents of some knowledge-base. Critique this truth-conditional ap¬ 
proach with respect to the meaning of words like near. 
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‘Then you should say what you mean,’ the March Hare went on. 
T do,’ Alice hastily replied; ‘at least-at least I mean what I say- 
that’s the same thing, you know’ 

‘Not the same thing a bit!’ said the Hatter. ‘You might just as 
well say that ”1 see what I eat” is the same thing as ”1 eat what 
I see”!’ 

Lewis Carroll, Alice in Wonderland 


This chapter presents a number of computational approaches to the 
problem of semantic analysis, the process whereby meaning representations analysis 0 
of the kind discussed in the previous chapter are composed and assigned 
to linguistic inputs. As we will see in this and later chapters, the creation 
of rich and accurate meaning representations necessarily involves a wide 
range of knowledge-sources and inference techniques. Among the sources of 
knowledge that are typically used are the meanings of words, the meanings 
associated with grammatical structures, knowledge about the structure of the 
discourse, knowledge about the context in which the discourse is occurring, 
and common-sense knowledge about the topic at hand. 

The first approach we cover is a kind of syntax-driven semantic anal¬ 
ysis that is fairly limited in its scope. It assigns meaning representations to 
inputs based solely on static knowledge from the lexicon and the grammar. 

In this approach, when we refer to an input’s meaning, or meaning represen¬ 
tation, we have in mind an impoverished representation that is both context- 
independent and inference-free. Meaning representations of this type corre¬ 
spond to the notion of a literal meaning introduced in the last chapter. 

There are two reasons for proceeding along these lines: there are some 
limited application domains where such representations are sufficient to pro- 
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duce useful results, and these impoverished representations can serve as in¬ 
puts to subsequent processes that can produce richer, more useful, meaning 
representations. Chapters 18 and 19 will show how these meaning represen¬ 
tations can be used in processing extended discourses, while Chapter 21 will 
show how they can be used in machine translation. 

Section 15.5 then presents two alternative approaches to semantic anal¬ 
ysis that are more well-suited to practical applications. The first approach, 
semantic grammars, has been widely applied in the construction of inter¬ 
active dialog systems. In this approach, the elements of the grammars are 
strongly motivated by the semantic entities and relations of the domain be¬ 
ing discussed. As we will see, the actual algorithms used in this approach 
are quite similar to those described in Section 15.1. The difference lies in 
the grammars that are used. 

The final approach, presented in Section 15.5, addresses the task of 
extracting small amounts of pertinent information from large bodies of text. 
As we will see, this information extraction task does not require the kind 
of complete syntactic analysis assumed in the other approaches. Instead, 
a series of quite limited, mostly finite-state, automata are combined via a 
cascade to produce a robust semantic analyzer. 


15.1 Syntax-Driven Semantic Analysis 

The approach detailed in this section is based on the principle of composi- 
tionaljty° F tionality. 1 The key idea underlying this approach is that the meaning of a 
sentence can be composed from the meanings of it parts. Of course, when in¬ 
terpreted superficially, this principle is somewhat less than useful. We know 
that sentences are composed of words, and that words are the primary car¬ 
riers of meaning in language. It would seem then that all this principle tells 
us is that we should compose the meaning representation for sentences from 
the meanings of the words that make them up. 

Fortunately, the Mad Hatter has provided us with a hint as to how to 
make this principle useful. The meaning of a sentence is not based solely on 
the words that make it up, it is based on the ordering, grouping, and relations 
among the words in the sentence. Of course, this is simply another way 

1 This is normally referred to as Frege’s principle of compositionality. There appears to be 
little reason for this ascription, since the principle never explicitly appears in any of his writ¬ 
ings. Indeed, many of his writings can be taken as supporting a decidedly non-compositional 
view. Janssen (1997) discusses this topic in more detail. 
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of saying that the meaning of a sentence is partially based on its syntactic 
structure. Therefore, in syntax-driven semantic analysis , the composition of 
meaning representations is guided by the syntactic components and relations 
provided by the kind of grammars discussed in Chapters 9, 11, and 12. 

We can begin by assuming that the syntactic analysis of an input sen¬ 
tence will form the input to a semantic analyzer. Figure 15.1 illustrates the 
obvious pipeline-oriented approach that follows directly from this assump¬ 
tion. An input is first passed through a parser to derive its syntactic analysis. 
This analysis is then passed as input to a semantic analyzer to produce a 
meaning representation. Note that although this diagram shows a parse tree 
as input, other syntactic representations such as feature structures, or lexi¬ 
cal dependency diagrams, can be used. The remainder of this section will 
assume tree-like inputs. 

Before moving on, we should make explicit a major assumption about 
the role ambiguity of this approach. In the syntax driven approach presented 
here, ambiguities arising from the syntax and the lexicon will lead to the cre¬ 
ation of multiple ambiguous meaning representations. It is not the job of the 
semantic analyzer, narrowly defined, to resolve these ambiguities. Instead, 
it is the job of subsequent interpretation processes with access to domain 
specific knowledge, and knowledge of context to select among competing 
representations. Of course, we can cut down on the number of ambiguous 
representations produced, through the use of robust part-of-speech taggers, 
prepositional phrase attachment mechanisms, and, as we will see in Chap¬ 
ter 16, word-sense disambiguation mechanisms. 

Let’s consider how such an analysis might proceed with the following 
example. 

(15.1) AyCaramba serves meat. 

Figure 15.2 shows the simplified parse tree (lacking feature attachments), 
along with an appropriate meaning representation for this example. As sug¬ 
gested by the dashed arrows, a semantic analyzer given this tree as input 
might fruitfully proceed by first retrieving a meaning representation from the 
subtree corresponding to the verb serves. The analyzer might next retrieve 


SEMANTIC 

ANALYZER 



546 


Chapter 15. Semantic Analysis 


A Server (e^ AyCaramba) A Served(e, M^at) 

\ YP 

Proper-Noun Verb - J Mass-Noun 


AyCaramba serves meat 

Figure 15.2 Parse tree for the sentence AyCaramba serves meat 



meaning representations corresponding to the two noun phrases in the sen¬ 
tence. Then using the representation acquired from the verb as a template, 
the noun phrase meaning representations can be used to bind the appropriate 
variables in the verb representation, thus producing the meaning representa¬ 
tion for the sentence as a whole. 

Unfortunately, there is a rather obvious problem with this simplified 
story. As described, the function used to interpret the tree in Figure 15.2 
must know, among other things, that it is the verb that carries the template 
upon which the final representation is based, where this verb occurs in the 
tree, where its corresponding arguments are, and which argument fills which 
role in the verb’s meaning representation. In other words, it requires a good 
deal of specific knowledge about this particular example and its parse tree to 
create the required meaning representation. Given that there are an infinite 
number of such trees for any reasonable grammar, any approach based on 
one semantic function for every possible tree is in serious trouble. 

Fortunately, we have faced this problem before. Languages are not 
defined by enumerating the strings or trees that are permitted, but rather by 
specifying finite devices that are capable of generating the required set of 
outputs. It would seem, therefore, that the right place for semantic knowl¬ 
edge in a syntax-directed approach is with the finite set of devices that are 
used to generate trees in the first place: the grammar rules and the lexical 
rule™ entries. This is known as the rule to rule hypothesis(Bach, 1976). 

Designing an analyzer based on this approach brings us back to the no¬ 
tion of parts and what it means for them to have meanings. The remainder of 
this section can be seen as an attempt to answer the following two questions. 

• What does it mean for syntactic constituents to have meanings? 

• What do these meanings have to be like so that they can be composed 
into larger meanings? 



Section 15.1. Syntax-Driven Semantic Analysis 


547 


Semantic Augmentations to Context-Free Grammar Rules 

In keeping with the approach begun in Chapter 11, we will begin by aug¬ 
menting context-free grammar rules with semantic attachments. These at- tachments 1 ' 
tachments can be thought of as instructions that specify how to compute 
the meaning representation of a construction from the meanings of its con¬ 
stituent parts. Abstractly, our augmented rules have the following structure. 

A —> oti... On {f(aj.sem, —, oc* .sent)} 

The semantic attachment to the basic context-free rule is shown in the 
{...} to the right of the rule’s syntactic constituents. This notation states that 
the meaning representation assigned to the construction A, which we will 
denote as A. sent, can be computed by running the function / on some subset 
of the semantic attachments of A’s constituents. 

This characterization of our semantic attachments as a simple func¬ 
tion application is rather abstract. To make this notion more concrete, we 
will walk through the semantic attachments necessary to compute the mean¬ 
ing representation for a series of examples beginning with Example 15.1, 
shown earlier in Figure 15.2. We will begin with the more concrete entities 
in this example, as specified by the noun phrases, and work our way up to the 
more complex expressions representing the meaning of the entire sentence. 

The concrete entities in this example are represented by the fopc constants 
AyCaramba and Meat. Our first task is to associate these constants with the 
constituents of the tree that introduce them. The first step toward accom¬ 
plishing this is to pair them with the lexical rules representing the words that 
introduce them into the sentence. 

ProperNoun AyCaramba {AyCaramba} 

MassNoun —> meat {Meat} 

These two rules specify that the meanings associated with the subtrees gen¬ 
erated by these rules consist of the constants AyCaramba and Meat. 

Note, however, that as the arrows in Figure 15.2 indicate, the subtrees 
corresponding to these rules do not directly contribute these fopc constants 
to the final meaning representation. Rather, it is the NPs higher in the tree 
that contribute them to the final representation. In keeping with the principle 
of compositionality, we can deal with this indirect contribution by stipulating 
that the upper TVPs obtain their meaning representations from the meanings 
of their children. In these two cases, we will assume that the meaning repre¬ 
sentations of the children are simply copied upward to the parents. 

NP ProperNoun {ProperNoun.sem} 




548 


Chapter 15. Semantic Analysis 


NP —> MassNoun {MassNoun.sem} 

These rules state that the meaning representation of the noun phrases 
are the same as the meaning representations of their individual components, 
denoted by ProperNoun.sem and MassNoun.sem. In general, it will be the 
case that for non-branching grammar rules, the semantic expression associ¬ 
ated with the child will be copied unchanged to the parent. 

Before proceeding, we should point out that there is at least one poten¬ 
tially confusing aspect to this discussion. While the static semantic attach¬ 
ment to our first NP rule is simply ProperNoun.sem , the semantic value of 
the tree produced by that rule in this example is AyCaramba. It is critical 
to distinguish between the semantic attachment of a rule, and the semantic 
value associated with a tree generated by a rule. The first is a set of in¬ 
structions on how to construct a meaning representation, while the second 
consists of the result of following those instructions. 

Returning to our example, having accounted for the constants in the 
representation, we can move on to the event underlying this utterance as 
specified by serves. As illustrated in Figure 15.2, a generic Serving event 
involves a Server and something Served , as captured in the following logical 
formula. 

3e,x,y Isa(e, Serving ) A Server (e^x) A Served(e,y) 

As a first attempt at this verb’s semantic attachment, we can simply 
take this logical formula as serve's semantic attachment, as in the following. 

Verb serves 

{3 e,x,y Isa(e, Serving) A Server (e,x) AServed(e : y)} 

Moving up the parse tree, the next constituent to be considered is the 
VP that dominates both serves and meat. Unlike the NPs, we can not simply 
copy the meaning of these children up to the parent VP. Rather, we need to 
incorporate the meaning of the NP into the meaning of the Verb and assign 
the resulting representation to the VP.sem. In this case, this consists of re¬ 
placing the variable y with the logical term Meat as the second argument of 
the Served role of the Serves event. This yields the following meaning rep¬ 
resentation, which can be glossed as something like someone serves meat. 

3 e,xIsa(e,Serving) AServer(e,x) AServed(e,Meat) 

To come up with this representation, the semantic attachment for the 
VP must provide a means to replace the quantified variable y within the body 
of V.sem with the logical constant Meat , as stipulated by NP.sem. Abstracting 
away from this specific example, the VP semantic attachment must have two 
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capabilities: the means to know exactly which variables within the Verb's 
semantic attachment are to be replaced by the semantics of the Verb's argu¬ 
ments, and the ability to perform such a replacement. 

Unfortunately, there is no straightforward way to do this given the 
mechanisms we now have at our disposal. The fopc formula we attached to 
the V.sem does not provide any advice about when and how each of its three 
quantified variables should be replaced, and we have no simple way, within 
our current specification of fopc, for performing such a replacement even if 
we did know. 

Fortunately, there is a notational extension to fopc called the lambda 
notation(Church, 1940) that provides exactly the kind of formal parameter Nation 
functionality that we need. This notation extends the syntax of fopc to 
include expressions of the following form. 

A xP{x) 

Such expressions consist of the Greek symbol A, followed by one or more 
variables, followed by a fopc expression that makes use of those variables. 

The usefulness of these A-expressions is based on the ability to apply 
them to logical terms to yield new fopc expressions where the formal pa¬ 
rameter variables are bound to the specified terms. This process is known 
as A-reduction and is little more than a simple textual replacement of the 
A variables with the specified fopc terms, accompanied by the subsequent 
removal of the A. The following expressions illustrate the application of a 
A-expression to the constant A, followed by the result of performing a A- 
reduction on this expression. 

A xP{x) (A) 

P(A) 

This A-notation provides both of the capabilities we said were needed in the 
Verb semantics: the formal parameter list makes a set of variables within the 
body available, and the A-reduction process implements the desired replace¬ 
ment of variables with terms. 

An important and useful variation of this technique is the use of one 
A-expression as the body of another as in the following expression. 

facXy Near{x,y) 

This fairly abstract expression can be glossed as the state of some¬ 
thing being near something else. The following expressions illustrate a sin¬ 
gle A-application and subsequent reduction with this kind of embedded A- 
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expression. 

XxXy Near(x , y) ( ICSI ) 

Xy Near(ICSI,y) 

The important point here is that the resulting expression is still a X- 
expression; the first reduction bound the variable x and removed the outer 
X, thus revealing the inner expression. As might be expected, this result¬ 
ing ^-expression can, in turn, be applied to another term to arrive at a fully 
specified logical formula, as in the following. 

Xy Near(ICSI , y) (AyCaramba) 

Near (ICSI, AyCaramba ) 

currying This technique, called currying 2 (Schonkfinkel, 1924), is a way of 

converting a predicate with multiple arguments into a sequence of single 
argument predicates. As we will see shortly, this technique is quite useful 
when the arguments to a predicate do not all appear together as daughters of 
the predicated in a parse tree. 

With the ^-notation and the process of ^-reduction, we have the tools 
needed to return to the semantic attachments for our VP constituent. Re¬ 
call that what was needed was a way to replace the variable representing the 
Served role with the meaning representation provided by the NP constituent 
of the VP. This can be accomplished in two steps: changing the semantic 
attachment of the Verb to a ^-expression, and having the semantic attach¬ 
ment of the VP apply this expression to the NP semantics. The first of these 
steps can be accomplished by designating x, the variable corresponding to 
the Served role, as the ^-variable for a ^-expression provided as the seman¬ 
tic attachment for serve. 

Verb —y serves 

{Xx3e,y Isa(e,Serving) AServer(e,y) AServed(e,x)} 

This attachment makes the variable x externally available to be bound 
by an application of this expression to a logical term. The attachment for our 
transitive VP rule, therefore, specifies a ^-application where the X-expression 
is provided by Verb.sem and the argument is provided by NP.sem. 

VP —> VerbNP {Verb.sem(NP.sem)} 

This ^-application results in the replacement, or binding, of x, the 
single formal parameter of the X-expression, with the value contained in 

2 Currying is the standard term, although Heim and Kratzer (1998) present an interest¬ 
ing argument for the term Schonkfinkelization over currying, since Curry later built on 
Schonfinkel’s work. 
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NP.sem. A X-reduction removes the X revealing the inner expression with 
the parameter x replaced by the constant Meat. This expression, the mean¬ 
ing of the verb phrase serves meat , is then the value of VP.sem. 

3 e,y Isa(e , Serving) A Server (e,y) A Served(e,Meat) 

To complete this example, we must create the semantic attachment for 
the S rule. Like the VP rule, this rule must incorporate an NP argument into 
the appropriate role in the event representation now residing in the VP.sem. It 
should, therefore, consist of another ^-application where the value of VP.sem 
provides the ^-expression and the sentence-initial NP.sem provides the final 
argument to be incorporated. 

S NP VP {VP.sem(NP.sem)} 

Unfortunately, as it now stands the value of VP.Sem doesn’t provide the 
necessary X expression. The lambda -application performed at the VP rule 
resulted in a generic FOPC expression with two existentially quantified vari¬ 
ables. The Verb attachment should instead have consisted of an embedded 
X-expression to make the Server role available for binding at the S level of 
the grammar. Therefore, our revised representation of the Verb attachment 
will be the following. 

Verb —> serves 

{XxXy Be Isa(e , Serving ) A Server(e,y) A Served(e,x)} 

The body of this Verb attachment consists of a ^-expression inside a 
A-expression. The outer expression provides the variable that is replaced by 
the first A-reduction, while the inner expression can be used to bind the final 
variable corresponding to the Server role. This ordering of the variables in 
the multiple layers ^-expressions in semantic attachment of the verb explic¬ 
itly encodes facts about the expected location of a Verb's arguments in the 
syntax. 

The parse tree for this example, with each node annotated with its cor¬ 
responding semantic value, is shown in Figure 15.3. 

This example has served to illustrate several of the most basic tech¬ 
niques used in this syntax-driven approach to semantic analysis. Section 
15.2 will provide a more complete inventory of semantic attachments for 
some of the major English grammatical categories. Before proceeding to 
that inventory, however, we will first analyze several additional examples. 
These examples will serve to introduce a few more of the basic constructs 
needed to make this approach work, and will illustrate the general approach 
to developing semantic attachments for a grammar. 
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S BeIsa(e } Serving) A Server(e,AC) !\Served{e,Meat) 



NP AC VP fac3elsa(e , Serving) A Serverie , x) A Served(e , Meat) 



NP Mear 

ProperNoun AC Verb Mass-Noun 




AyCaramba serves meat 

Figure 15.3 Parse tree with semantic attachments for AyCaramba serves 
meat. 


Let’s consider the following variation on Example 15.1. 

(15.2) A restaurant serves meat. 

Since the verb phrase of this example is unchanged from Example 15.1, we 
can restrict our attention to the derivation of the semantics of the subject 
noun phrase and its subsequent integration with the verb phrase in the S rule. 
As a starting point, let’s assume that the following formula is a plausible 
representation for the meaning of the subject in this example. 

Bxlsa(x, Restaurant) 

Combining this new representation with the one already developed for the 
verb phrase, yields the following meaning representation. 

Be, x Isa(e , Serving) 

A Server(e,x) A Served(e,Meat) A Isa(x, Restaurant) 

In this formula, the restaurant, represented by the variable x, is specified as 
playing the role of the Server by its presence as the second argument to the 
Server predicate. 

Unfortunately, the ^-application specified as the semantic attachment 
for the S rule will not produce this result. A literal interpretation of X- 
reduction as a textual replacement results in the following expression, where 
the entire meaning representation of the noun phrase is embedded in the 
Server predicate. 

Be Isa(e , Serving) 

A Server (e, Bxlsa{x,Restaurant)) A Served(e,Meat) 

Although this expression has a certain intuitive appeal, it is not a valid 
fopc formula. Expressions like the one denoting our restaurant can not 
appear as arguments to predicates; such arguments are limited to fopc terms. 
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In fact, since by definition ^-expressions can only be applied to fopc terms, 
the application of the ^-expression attached to the VP to the semantics of the 
subject was ill-formed to begin with. 

We can solve this problem in a manner similar to the way that X- 
expressions were used to solve the verb phrase and S semantic attachment 
problems: by adding a new notation to the existing fopc syntax that facil¬ 
itates the compositional creation of the desired meaning representation. In 
this case, we will introduce the notion of a complex-term that allows fopc term plex 
expressions like 3xlsa(x, Restaurant) to appear in places where normally 
only ordinary fopc terms would appear. Formally, a complex-term is an 
expression with the following three-part structure. 

< Quantifier variable body > 

Applying this notation to our current example, we arrive at the follow¬ 
ing representation. 

Be Isa(e , Serving ) 

AServer(e,< 3xlsa(x : Restaurant) >) A Served(e,Meat) 

As was the case with X-expressions, this notational change will only 
be useful if we can provide a straightforward way to convert it into ordinary 
fopc syntax. This can be accomplished by rewriting any predicate using a 
complex-term according to the following schema. 

P(< Quantifier variable body >) 

Quanti fier variable body Connective P(variable) 

In other words, the complex-term: 

1. Is extracted from the predicate in which it appears, 

2. Is replaced by the variable that represents the object in question, 

3. And has its variable, quantifier, body prepended to the new expression 

through the use of an appropriate connective. 

The following pair of expressions illustrates this complex-term reduc¬ 
tion on our current example. 

Server(e, < Bxlsa(x, Restaurant) >) 

3xlsa(x, Restaurant) A Server(e,x) 

The connective that is used to attach the extracted formula to the front of the 
new expression depends on the type of the quantifier being used: A is used 
with 3, and => is used with V. 
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It will also be useful to be able to access the three components of 
complex-terms. We will, therefore, extend the syntax used to refer to the 
semantics of a constituent by allowing reference to its parts. For exam¬ 
ple, if A.sem is a complex-term then A.sem.quantifier, A.sent.variable, and 
A.sem.body retrieve the complex-term’s quantifier, variable, and body, re¬ 
spectively. 

Returning to Example 15.2, we can now address the creation of the 
target meaning representation for the phrase a restaurant. Given the simple 
syntactic structure of this noun phrase, the job of the NP semantic attachment 
is fairly straightforward. 

NP —> DetNominal {< Det.semxNominal.sem(x) >} 

This attachment creates a complex-term consisting of a quantifier retrieved 
from the Det , followed by an arbitrary variable, and then an application of the 
^-expression associated with the Nominal to that variable. This ^-application 
ensures that the correct variable appears within the predicate specified by the 
Nominal. 

The attachment for the determiner simply specifies the quantifier to be 

used. 

Det —y a {3} 

The job of the nominal category is to create the Isa formula and X- 
expression needed for use in the noun phrase. 

Nominal —> Noun {XxIsa(x,Noun.sem)} 

Finally, the noun attachment simply provides the name of the category 
being discussed. 

Noun restaurant {Restaurant} 

In walking through this example, we have introduced five concrete 
mechanisms that instantiate the abstract functional characterization of se¬ 
mantic attachments that began this section. 

• The association of normal fopc expressions with lexical items. 

• The association of function-like ^-expressions with lexical items. 

• The copying of semantic values from children to parents. 

• The function-like application of ^-expressions to the semantics of one 

or more children of a constituent. 

• The use of complex-terms to allow quantified expressions to be tem¬ 
porarily treated as terms. 
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The introduction of ^-expressions and complex-terms was motivated 
by the gap between the syntax of fopc and the syntax of English. These 
extra-logical devices serve to bring the syntax of fopc closer to the syn¬ 
tax of the language being processed thus facilitating the semantic analysis 
process. Meaning representations that make use of these kinds of devices 
are usually referred to as quasi-logical forms or intermediate representa- logical 
tions. Note, there is a subtle difference in usage between these two uses. The INTERMEDI 
term quasi-logical form is usually applied to representations that can easily representa- 
be converted to a logical representation via some simple syntactic transfer- TI0NS 
mation. The term intermediate representation is normally used to refer to 
meaning representations that serve as input to further analysis processes in 
an attempt to produce deeper meaning representations. 

For the purposes of this chapter, our meaning representations are quasi- 
logical forms since they can easily be converted to fopc. From a somewhat 
broader perspective, they are also intermediate forms since further interpre¬ 
tation is certainly needed to get them closer to reasonable meaning represen¬ 
tations. 

The few rules introduced in this section also serve to illustrate a prin¬ 
ciple that guides the design of semantic attachments in the compositional 
framework. In general, it is the lexical rules that provide content level pred¬ 
icates and terms for our meaning representations. The semantic attachments 
to grammar rules put these predicates and terms together in the right ways, 
but do not in general introduce predicates and terms into the representation 
being created. 

Quantifier Scoping and the Translation of Complex Terms 

The schema given above to translate expressions containing complex terms 
into fopc expressions is, unfortunately, not unique. Consider the following 
example, along with its original unscoped meaning representation. 

(15.3) Every restaurant has a menu. 

3elsa(e , Having ) 

A Haver{e, < Vx Isa(x,Restaurant) >) 

AHad(e,< 3y Isa{y,Menu) >) 

If the complex-terms filling the Haver and the Had roles are rewritten 
so that the quantifier for the Haver role has the outer scope, then the result 
is the following meaning representation, which corresponds to the common- 
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sense interpretation of this sentence. 

\/xRestaurant(x) => 

3e,y Having(e) AHaver(e,x) AIsa(y,Menu) AHad(e,y) 

On the other hand, if the terms are rewritten in the reverse order, then 
the following fopc representation results, which states that there is one 
menu that all restaurants share. 

3y Isa(y , Menu) A Vx Isa(x.Restaurant) => 

3eHaving(e) AHaver(e,x) AHad(e,y) 

This example illustrates the problem of ambiguous quantifier scoping 
- a single logical formula with two complex terms gives rise to two distinct 
and incompatible fopc representations. In the worst case, sentences with N 
quantifiers will have 0(N\) different possible quantifier scopings. 

In practice, most systems employ an ad hoc set of heuristic preference 
rules that can be used to generate preferred forms in order of their overall 
likelihood. In cases where no preference rules apply, a left to right quantifier 
ordering that mirrors the surface order of the quantifiers is used. Domain 
specific knowledge can then be used to either accept a quantified formula, or 
reject it and request another formula. Alshawi (1992) presents a comprehen¬ 
sive approach to generating plausible quantifier scopings. 


15.2 Attachments for a Fragment of English 


This section describes a set of semantic attachments for a small fragment 
of English. As in the rest of this chapter, to keep the presentation simple, 
we omit the feature structures associated with these rules when they are not 
needed. Remember that these features are needed to ensure that the cor¬ 
rect rules are applied in the correct situations. Most importantly for this 
discussion, they are needed to ensure that the correct verb entries are being 
employed based on their subcategorization feature structures. 

Sentences 

For the most part, our semantic discussions have only dealt with declarative 
sentences. This section expands our coverage to include the other sentence 
types first introduced in Chapter 9: imperatives, Yes/No questions, and WH 
questions. Let’s start by considering the following examples. 

(15.4) Flight 487 serves lunch. 



Section 15.2. Attachments for a Fragment of English 


557 


(15.5) Serve lunch. 

(15.6) Does Flight 207 serve lunch? 

(15.7) Which flights serve lunch? 

The meaning representations of these examples all contain proposi¬ 
tions concerning the serving of lunch on flights. However, they differ with 
respect to the role that these propositions are intended to serve in the settings 
in which they are uttered. More specifically, the first example is intended to 
convey factual information to a hearer, the second is a request for an action, 
and the last two are requests for information. To capture these differences, 
we will introduce a set of operators that can be applied to fopc sentences 
in the same way that belief operators were used in Chapter 14. Specifically, 
the operators DCL , IMP, YNQ , and WHQ will be applied to the fopc repre¬ 
sentations of declaratives, imperatives, yes-no questions, and wh-questions, 
respectively. 

Producing meaning representations that make appropriate use of these 
operators requires the right set of semantic attachments for each of the pos¬ 
sible sentence types. For declarative sentences, we can simply alter the basic 
sentence rule we have been using as follows. 

S NP VP {DCL(VP.sem(NP.sem ))} 

The normal interpretation for a representation headed by the DCL operator 
would be as a factual statement to be added to the current knowledge-base. 

Imperative sentences begin with a verb phrase and lack an overt sub¬ 
ject. Because of the missing subject, the meaning representation for the main 
verb phrase will consist of a ^-expression with an unbound X -variable rep¬ 
resenting this missing subject. To deal with this, we can simply supply a 
subject to the ^-expression by applying a final ^-reduction to a dummy con¬ 
stant. The IMP operator can then be applied to this representation as in the 
following semantic attachment. 

S VP {IMP (VP sem (Dummy You ))} 

Applying this rule to Example 15.5, results in the following represen¬ 
tation. 

IMP(3eServing(e) A Server(e,DummyYou) A Served(e, Lunch) 

As will be discussed in Chapter 19, imperatives can be viewed as a kind of 
speech act - actions that are performed by virtue of being uttered. 

As discussed in Chapter 9, yes-no-questions consist of a sentence- 
initial auxiliary verb, followed by a subject noun phrase and then a verb 
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phrase. The following semantic attachment simply ignores the auxiliary, and 
with the exception of the YNQ operator, constructs the same representation 
that would be created for the corresponding declarative sentence. 

S -> AuxNP VP { YNQ (VP sem (NP. sem ))} 

The use of this rule with for Example 15.6 produces the following rep¬ 
resentation. 

YNQ(3eServing(e) A Server(e,Flt201) A Served(e, Lunch)) 

Yes-no-questions should be thought as asking the whether the proposi¬ 
tional part of its meaning is true or false given the knowledge currently con¬ 
tained in the knowledge-base. Adopting the kind of semantics described in 
Chapter 14, yes-no-questions can be answered by determining if the proposi¬ 
tion is in the knowledge-base, or if can be inferred from the knowledge-base. 

Unlike yes-no-questions, wh-subject-questions ask for specific infor¬ 
mation about the subject of the sentence rather than the sentence as a whole. 
The following attachment produces a representation that consists of the op¬ 
erator WHQ, the variable corresponding to the subject of the sentence, and 
the body of the proposition. 

S WhWord NP VP { WHQ (NP sem, var, VP sem {NP sem ))} 

The following representation is the result of applying this rule to Ex¬ 
ample 15.7. 

WH g(x, Be , x Isa(e, Serving) A Server(e , x) 

A Served(e,Lunch) A Isa(x,Flight)) 

Such questions can be answered by returning a set of assignments for the 
subject variable that make the resulting proposition true with respect to the 
current knowledge-base. 

Finally, consider the following wh-non-subject-question. 

(15.8) How can I go from Minneapolis to Long Beach? 

In examples like this, the question is not about the subject of the sentence but 
rather some other argument, or some aspect of the proposition as a whole. 
In this case, the representation needs to provide an indication as to what the 
question is about. The following attachment provides this information by 
providing the semantics of the auxiliary as an argument to the WHQ operator. 

S -> WhWord AuxNP VP {WHQ WhWord, sem VPsem(NP.sem)} 



Section 15.2. Attachments for a Fragment of English 


559 


The following representation would result from an application of this 
rule to Example 15.8. 

WHQ{How , Be Isa(e , Going ) A Goer{e , User) 

AOrigin(e,Minn) A Destination (e, LongBeach )) 

As we will discuss in Section 15.5 and Chapter 19, correctly answering this 
kind of question involves a fair amount of domain specific reasoning. For ex¬ 
ample, the correct way to answer Example 15.8 is to search for flights with 
the specified departure and arrival cities. Note, however, that there is no men¬ 
tion of flights or flying in the actual question. The question-answerer there¬ 
fore has to apply knowledge specific to this domain to the effect that ques¬ 
tions about going places are really questions about flights to those places. 

Finally, we should make it clear that this particular attachment is only 
useful for rather simple wh-questions without missing arguments or embed¬ 
ded clauses. As discussed in Chapter 11, the presence of long-distance 
dependencies in these questions requires additional mechanisms to deter¬ 
mine exactly what is being asked about. Woods (1977) and Alshawi (1992) 
provide extensive discussions of general mechanisms for handling wh-non- 
subject questions. Section 15.5 presents a more ad hoc approach that is often 
used in practical systems. 

Noun Phrases 

As we have already seen, the meaning representations for noun phrases can 
be either normal fopc terms or complex-terms. The following sections de¬ 
tail the semantic attachments needed to produce meaning representations for 
some of the most frequent kinds of English noun phrases. Unfortunately, as 
we will see, the syntax of English noun phrases provides surprisingly little 
insight into their meaning. It is often the case that the best we can do is 
provide a rather vague intermediate level of meaning representation that can 
serve as input to further interpretation processes. 

Compound Nominals 

Compound nominals, also known as noun-noun sequences, consist of simple 
sequences of nouns, as in the following examples. 

(15.9) Flight schedule 

(15.10) Summer flight schedule 

As noted in Chapter 9, the syntactic structure of this construction can be 
captured by the regular expression Noun*, or by the following context-free 
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grammar rules. 

Nominal Noun 

Nominal —> Noun Nominal 

In these constructions, the final noun in the sequence is the head of the 
phrase and denotes an object that is semantically related in some unspeci¬ 
fied way to the other nouns that precede it in the sequence. In general, an 
extremely wide range of common-sense relations can be denoted by this con¬ 
struction. Discerning the exact nature of these relationships is well beyond 
the scope of the kind of superficial semantic analysis presented in this chap¬ 
ter. The attachment in the following rule builds up a vague representation 
that simply notes the existence of a semantic relation between the head noun 
and the modifying nouns, by incrementally noting such a relation between 
the head noun and each noun to its left. 

Nominal Noun Nominal 

{Ax Nominal. sem(x) ANN(Noun.sem, x)} 

The relation NN is used to specify that a relation holds between the 
modifying elements of a compound nominal and the head Noun. In the ex¬ 
amples given above, this leads to the following meaning representations. 

hclsa(x : Schedule ) A NN(x, Flight) 

hclsa(x , Schedule) A AW(x, Flight) A AW(x, Summer) 

Note that this representation correctly instantiates a term representing 
a Schedule , while avoiding the creation of terms representing either a Flight 
or Summer. 

Genitive Noun Phrases 

Recall from Chapter 9 that genitive noun phrases make use of complex deter¬ 
miners that consist of noun phrases with possessive markers, as in Atlanta's 
airport and MaharanVs menu. It is quite tempting to represent the relation 
between these words as an abstract kind of possession. A little introspec¬ 
tion, however, reveals that the relation between a city and its airport has little 
in common with a restaurant and its menu. Therefore, as with compound 
nominals, it turns out to be best to simply state an abstract semantic relation 
between the various constituents. 

NP —> ComplexDet Nominal 

{< 3xNominal.sem(x) A GA(x, ComplexDet.sem) >} 
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ComplexDet —> NP ’s {NP. sem} 

Applying these rules to Atlanta’s airport results in the following com¬ 
plex term. 

< 3xlsa(x,Airport) A GN(x, Atlanta) > 

Subsequent semantic interpretation would have to determine that the relation 
denoted by the relation GN is actually a location. 

Adjective Phrases 

English adjectives can be split into two major categories: pre-nominal and 
predicate. These categories are exemplified by the following BERP exam¬ 
ples. 

(15.11) I don’t mind a cheap restaurant. 

(15.12) This restaurant is cheap. 

For the pre-nominal case, an obvious and often incorrect proposal for 
the semantic attachment is illustrated in the following rules. 

Nominal —> Adj Nominal 

{Ax Nominal.sem(x) A lsa{x, Adj.sem)} 

Adj y cheap {Cheap} 

This solution modifies the semantics of the nominal by applying the predi¬ 
cate provided by the adjective to the variable representing the nominal. For 
our cheap restaurant example, this yields the following fairly reasonable rep¬ 
resentation. 

Ax Isa(x. Restaurant) A Isa{x, Cheap) 

This is an example of what is known as intersective semantics since 
the meaning of the phrase can be thought of as the intersection of the cate¬ 
gory stipulated by the nominal and the category stipulated by the adjective. 
In this case, this amounts to the intersection of the category of cheap things 
with the category of restaurants. 

Unfortunately, this solution often does the wrong thing. For example, 
consider the following meaning representations for the phrases small ele¬ 
phant, former friend , and fake gun. 

Ax Isa(x, Elephant) A Isa{x, Small) 

Ax Isa(x, Friend) A Isa(x. Former) 

Ax Isa(x, Gun) A 1sa(x. Fake) 


INTERSEC¬ 

TIVE 

SEMANTICS 
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Each of these representations is peculiar in some way. The first one states 
that this particular elephant is a member of the general category of small 
things, which is probably not true. The second example is strange in two 
ways: it asserts that the person in question is a friend, which is false, and it 
makes use of a fairly unreasonable category of former things. Similarly, the 
third example asserts that the object in question is a gun despite the fact that 
fake means it is not one. 

As with compound nominals, there is no clever solution to these prob¬ 
lems within the bounds of our current compositional framework. Therefore, 
the best approach is to simply note the status of a specific kind of modifi¬ 
cation relation and assume that some further procedure with access to addi¬ 
tional relevant knowledge can replace this vague relation with an appropriate 
representation (Alshawi, 1992). 

Nominal Adj Nominal 

{XxNominal.sem(x) A AM(x,Adj.sem)} 

Applying this rule to a cheap restaurant results in the following formula. 

3x Isa(x, Restaurant) A AM(x, Cheap ) 

Note that even this watered-down proposal produces representations 
that are logically incorrect for the fake and former examples. In both cases, 
it asserts that the objects in question are in fact members of their stated cate¬ 
gories. In general, the solution to this problem has to be based on the specific 
semantics of the adjectives and nouns in question. For example, the seman¬ 
tics of former has to involve some form of temporal reasoning, while fake 
requires the ability to reason about the nature of concepts and categories. 

Verb Phrases 

The general schema for computing the semantics of verb phrases relies on 
the notion of function application. In most cases, the A-expression attached 
to the verb is simply applied to the semantic attachments of the verb’s ar¬ 
guments. There are, however, a number of situations that force us to depart 
somewhat from this general pattern. 

Infinitive Verb Phrases 

A fair number of English verbs take some form of verb phrase as one of their 
arguments. This complicates the normal verb phrase semantic schema since 
these argument verb phrases interact with the other other arguments of the 
head verb in ways that are not completely obvious. 
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Consider the following example. 

(15.13) I told Harry to go to Maharani. 

The meaning representation for this example should be something like the 
the following. 

Be, /, x Isa(e, Telling ) A Isa(f, Going ) 

ATeller(e, Speaker ) A Tellee(e , Harry ) A ToldT hing(e , /) 

A Goer(f, Harry ) A Destination(f,x) 

There are two interesting things to note about this meaning representa¬ 
tion: the first is that it consists of two events, and the second is that one of the 
participants, Harry , plays a role in both of the two events. The difficulty in 
creating this complex representation falls to the verb phrase dominating the 
verb tell which will something like the following as its semantic attachment. 

hc,y Xz Be Isa(e , Telling ) 

KTeller{e , z) A Tellee(e , x) A ToldThing(e , y) 

Semantically, we can interpret this subcategorization frame for Tell as pro¬ 
viding three semantic roles: a person doing the telling, a recipient of the 
telling, and the proposition being conveyed. 

The difficult part of this example involves getting the meaning repre¬ 
sentation for the main verb phrase correct. As shown in Figure 15.2, Harry 
plays the role of both the Tellee of the Telling event and the Goer of the 
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Going event. However, Harry is not available when the Going event is cre¬ 
ated within the infinitive verb phrase. 

Although there are several possible solutions to this problem, it is usu¬ 
ally best to stick with a uniform approach to these problems. Therefore, we 
will start by simply applying the semantics of the verb to the semantics of 
the other arguments of the verb as follows. 

VP —> VerbNPVPto {Verb.sem(NPsent, VPto.sem)} 

Since the to in the infinitive verb phrase construction does not con¬ 
tribute to its meaning, we simply copy the meaning of the child verb phrase 
up to the infinitive verb phrase. Recall, that we are relying on the unseen 
feature structures to ensure that only the correct verb phrases can with this 
construction. 

VP to —> to VP {VP.sem} 

In this solution, the verb’s semantic attachment has two tasks: incorpo¬ 
rating the NP.sem , the Goer , into the VPto.sem , and incorporating the Going 
event as the ToldThing of the Telling. The following attachment performs 
both tasks. 

Verb —> tell 
{Xx,y 
Xz 

3e,y.variable Isa(e , Telling) 

ATeller(e,z) ATellee(e,x) 

AToldThing(e,y.variable) Ay(x) 

In this approach, the X-variable x plays the role of the Tellee of the telling 
and the argument to the semantics of the infinitive, which is now contained as 
a ^-expression in the variable y. The expression y(x) represents a ^-reduction 
that inserts Harry into the Going event as the Goer. The notation y.variable , 
is analogous to the notation used for complex-term variables, and gives us 
access to the event variable representing the Going event within the infini¬ 
tive’s meaning representation. 

Note that this approach plays fast and loose with the definition of X- 
reduction, in that it allows ^-expressions to be passed as arguments to other 
^-expressions, when technically only fopc terms can serve that role. This 
technique is a convenience similar to the use of complex terms in that it al¬ 
lows us to temporarily treat complex expressions as terms during the creation 
of meaning representations. 
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Prepositional Phrases 

At a fairly abstract level, prepositional phrases serve two distinct functions: 
they assert binary relations between their heads and the constituents to which 
they are attached, and they signal arguments to constituents that have an ar¬ 
gument structure. These two functions argue for two distinct types of prepo¬ 
sitional phrases that differ based on their semantic attachments. We will 
consider three places in the grammar where prepositional phrases serve these 
roles: modifiers of noun phrases, modifiers of verb phrases, and arguments 
to verb phrases. 

Nominal Modifier Prepositional Phrases 

Modifier prepositional phrases denote a binary relation between the concept 
being modified, which is external to the prepositional phrase, and the head of 
the prepositional phrase. Consider the following example and its associated 
meaning representation. 

(1) A restaurant on Pearl 

3x Isa(x, Restaurant) A On(x, Pearl) 

The relevant grammar rules that govern this example are the following. 

NP —> Det Nominal 
Nominal —> Nominal PP 
PP P NP 

Proceeding in a bottom-up fashion, the semantic attachment for this 
kind of relational preposition should provide a two-place predicate with its 
arguments distributed over two ^-expressions, as in the following. 

P —► on {XyXx On(x,y)} 

With this kind of arrangement, the first argument to the predicate is provided 
by the head of prepositional phrase and the second is provided by the con¬ 
stituent that the prepositional phrase is ultimately attached to. The following 
semantic attachment provides the first part. 

PP PNP {Psem(NPsem)} 

This X-application results in a new ^-expression where the remaining argu¬ 
ment is the inner X- variable. 

This remaining argument can be incorporated using the following nom¬ 
inal construction. 

Nominal —> Nominal PP {XzNominalsem(z) APP.sem(z)} 
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Verb Phrase Modifier Prepositional Phrases 

The general approach to modifying verb phrases is similar to that of modi¬ 
fying nominals. The differences lie in the details of the modification in the 
verb phrase rule; the attachments for the preposition and prepositional phrase 
rules are unchanged. Let’s consider the phrase ate dinner in a hurry which 
is governed by the following verb phrase rule. 

VP VPPP 

The meaning representation of the verb phrase constituent in this con¬ 
struction, ate dinner , is a ^-expression where the X variable represents the as 
yet unseen subject. 

Xx3e Isa(e , Eating) A Eater(e,x) A Eaten(e , Dinner) 

The representation of the prepositional phrase is also a ^-expression 
where the X variable is the second argument in the PP semantics. 

Xxln(x : < 3h Hurry(h) >) 

The correct representation for the modified verb phrase should contain 
the conjunction of these two representations with the Eating event variable 
filling the first argument slot of the In expression. In addition, this modified 
representation must remain a X-expression with the unbound Eater variable 
as the new ^-variable. The following attachment expression fulfills all of 
these requirements. 

VP —> VP PP {XyVP.sem(y) A PP.sem(VRsem.variable)} 

There are two aspects of this attachment that require some elabora¬ 
tion. The first involves the application of the constituent verb phrases’ X- 
expression to the variable y. Binding the lower ^-expression’s variable to 
a new variable allows us to lift the lower variable to the level of the newly 
created ^-expression. The result of this technique is a new ^-expression with 
a variable that, in effect, plays the same role as the original variable in the 
lower expression. In this case, this allows a ^-expression to be modified 
during the analysis process before the argument to the expression is actually 
available. 

The second new aspect in this attachment involves the VP.sem.variable 
notation. This notation is used to access the event-variable representing the 
underlying meaning of the verb phrase, in this case, e. This is analogous 
to the notation used to provide access the various parts of complex-terms 
introduced earlier. 



Section 15.3. Integrating Semantic Analysis into the Earley Parser 


567 


Applying this attachment to the current example yields the following 
representation, which is suitable for combination with a subsequent subject 
noun phrase. 

Xy3e Isa(e , Eating ) A Eater(e,y ) A Eaten(e , Dinner) 

A In(e,< 3hHurry(h ) >) 

Verb Argument Prepositional Phrases 

The prepositional phrases is this category serve to signal the role an argument 
plays in some larger event structure. As such, the preposition itself does not 
actually modify the meaning of the noun phrase. Consider the following 
example of role signaling prepositional phrases. 

(15.14) I need to go from Boston to Dallas. 

In examples like this, the arguments to go are expressed as a prepositional 
phrases. However, the meaning representations of these phrases should con¬ 
sist solely of the unaltered representation of their head nouns. To handle 
this, argument prepositional phrases are treated in the same way that non¬ 
branching grammatical rules are; the semantic attachment of the noun phrase 
is copied unchanged to the semantics of the larger phrase. 

PP -A PNP {NP. sent} 

The verb phrase can then assign this meaning representation to the appro¬ 
priate event role. A more complete account of how these argument bear¬ 
ing prepositional phrases map to underlying event roles will be presented in 
Chapter 16. 

15.3 Integrating Semantic Analysis into the 
Earley Parser 

In Section 15.1, we suggested a simple pipeline architecture for a semantic 
analyzer where the results of a complete syntactic parse are passed to a se¬ 
mantic analyzer. The motivation for this notion stems from the fact that the 
compositional approach requires the syntactic parse before it can proceed. It 
is, however, also possible to perform semantic analysis in parallel with syn¬ 
tactic processing. This is possible because in our compositional framework, 
the meaning representation for a constituent can be created as soon as all of 
its constituent parts are present. This section describes just such an approach 
to integrating semantic analysis into the Earley parser from Chapter 10. 
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The integration of semantic analysis into an Earley parser is straight¬ 
forward and follows precisely the same lines as the integration of unification 
into the algorithm given in Chapter 11. Three modifications are required to 
the original algorithm: 

• The rules of the grammar are given a new field to contain their semantic 
attachments. 

• The states in the chart are given a new field to hold the meaning repre¬ 
sentation of the constituent. 

• The Enqueue function is altered so that when a complete state is en¬ 
tered into the chart its semantics are computed and stored in the state’s 
semantic field. 


procedure Enqueue {state, chart-entry ) 
if Incomplete?^/*?) then 

if state is not already in chart-entry then 
P\JSH(state, chart-entry) 
else if Unify- STATE(state) succeeds then 
if Apply-Semantics {state) succeeds then 
if state is not already in chart-entry then 
Push {state, chart-entry) 

procedure Apply-Sem antics {state) 
meaning-rep «— Apply (state.semantic-attachment, state) 
if meaning-rep does not equal failure then 
state.meaning-rep «— meaning-rep 


Figure 15.5 The Enqueue function modified to handle semantics. If 
the state is complete and unification succeeds then Enqueue calls Apply- 
Semantics to compute and store the meaning representation of completed 
states. 


Figure 15.5 shows the Enqueue and functions modified to create 
meaning representations. When Enqueue is passed a complete state that 
can successfully unify its unification constraints it calls Apply-Semantics 
to compute and store the meaning representation for this state. Note the im¬ 
portance of performing feature-structure unification prior to semantic analy¬ 
sis. This ensures that semantic analysis will be performed only on valid trees 
and that features needed for semantic analysis will be present. 
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The primary advantage of this integrated approach over the pipeline 
approach lies in the fact that Apply-Semantics can fail in a manner similar 
to the way that unification can fail. If a semantic ill-formedness is found in 
the meaning representation being created, the corresponding state can be 
blocked from entering the chart. In this way, semantic considerations can be 
brought to bear during syntactic processing. Chapter 16 describes in some 
detail the various ways that this notion of ill-formedness can be realized. 

Unfortunately, this also illustrates one of the primary disadvantages of 
integrating semantics directly into the parser — considerable effort may be 
spent on the semantic analysis of orphan constituents that do not in the end 
contribute to a successful parse. The question of whether the gains made by 
bringing semantics to bear early in the process outweigh the costs involved 
in performing extraneous semantic processing can only be answered on a 
case by case basis. 

15.4 Idioms and Compositionality 

Ce corps qui s’appelait et qui s’appelle encore le saint empire 
romain n’etait en aucune maniere ni saint, ni romain, ni empire. 

This body, which called itself and still calls itself the Holy Roman 
Empire, was neither Holy, nor Roman, nor an Empire. 

-Voltaire 3 , 1756. 

As innocuous as it seems, the principle of compositionality runs into trouble 
fairly quickly when real language is examined. There are many cases where 
the meaning of a constituent is not based on the meaning of its parts, at least 
not in the straightforward compositional sense. Consider the following WSJ 
examples. 

(15.15) Coupons are just the tip of the iceberg. 

(15.16) The SEC’s allegations are only the tip of the iceberg. 

(15.17) Coronary bypass surgery, hip replacement and intensive-care units 
are but the tip of the iceberg. 

The phrase the tip of the iceberg in each of these examples clearly doesn’t 
have much to do with tips or icebergs. Instead, it roughly means something 

3 Essai sur les moeurs et les esprit des nations. Translation by Y. Sills, as quoted in (Sills 
and Merton, 1991). 
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like the beginning. The most straightforward way to handle idiomatic con¬ 
structions like these is to introduce new grammar rules specifically designed 
to handle them. These idiomatic rules mix lexical items with grammatical 
constituents, and introduce semantic content that is not derived from any of 
its parts. 

Consider the following rule as an example of this approach. 

NP the tip of the iceberg 
{Beginning} 

The lower case items on the right-hand side of this rule are intended 
to represent precisely words in the input. Although, the constant Beginning 
should not be taken too seriously as a meaning representation for this idiom, 
it does illustrate the idea that the meaning of this idiom is not based on 
the meaning of any of its parts. Note that an Earley-style analyzer with 
this rule will now produce two parses when this phrase is encountered: one 
representing the idiom and one representing the compositional meaning. 

Not surprisingly, as with the rest of the grammar, it may take a few tries 
to get to these rules right. Consider the following iceberg examples from the 
WSJ corpus. 

(15.18) And that’s but the tip of Mrs. Ford’s iceberg. 

(15.19) These comments describe only the tip of a 1,000-page iceberg. 

(15.20) The 10 employees represent the merest tip of the iceberg. 

The rule given above is clearly not general enough to handle these cases. 
These examples indicate that there is a vestigial syntactic structure to this 
phrase that at permits some variation in the determiners used and also per¬ 
mits some adjectival modification of both the iceberg and the tip. A more 
promising rule would be something along the following lines. 

NP TipNP of IcebergNP 
{Beginning} 

Here the categories TipNP and IcebergNP can be given an internal 
nominal-like structure that permits some adjectival modification and some 
variation in the determiners, while still restricting the heads of these noun 
phrases to the lexical items tip and iceberg. Note that this syntactic solution 
ignores the thorny issue that the modifiers mere and 1000-page seem to in¬ 
dicate that both the tip and iceberg may in fact play some compositional role 
in the meaning of the idiom. We will return to this topic in Chapter 16, when 
we take up the issue of metaphor. 

To summarize, handling idioms requires at least the following changes 
to the general compositional framework. 
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• Allow the mixing of lexical items with traditional grammatical con¬ 
stituents. 

• Allow the creation of additional idiom-specific constituents needed to 
handle the correct range of productivity of the idiom. 

• Permit semantic attachments that introduce logical terms and predi¬ 
cates that are not related to any of the constituents of the rule. 

This discussion is obviously only the tip of an enormous iceberg. Id¬ 
ioms are far more frequent and far more productive than is generally rec¬ 
ognized and pose serious difficulties for many applications, including as we 
will see in Chapter 21, machine translation. 

15.5 Robust Semantic Analysis 

As we noted earlier, when syntax-driven semantic analysis is is applied in 
practice, certain compromises have to be made to facilitate system develop¬ 
ment and efficiency of operation. The following sections describe the two 
primary ways of instantiating a syntax-driven approach in practical systems. 

Semantic Grammars 

When we first introduced Frege’s principle of compositionality in Section 
15.1, we noted that the parts referred to in that principle are the constituents 
provided by a syntactic grammar. Unfortunately, the syntactic structures 
provided by such grammars are often not particularly well-suited for the 
task of compositional semantic analysis. This is not particularly surpris¬ 
ing since capturing elegant syntactic generalizations and avoiding overgen¬ 
eration carry considerably more weight in the design of grammars than se¬ 
mantic sensibility does. This mismatch between the structures provided by 
traditional grammars and those needed for compositional semantic analysis 
typically manifests itself in the following three ways. 

• Key semantic elements are often widely distributed across parse trees, 
thus complicating the composition of the required meaning represen¬ 
tation. 

• Parse trees often contain many syntactically motivated constituents that 
play essentially no role in semantic processing. 

• The general nature of many syntactic constituents results in semantic 
attachments that create nearly vacuous meaning representations. 
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As an example of the first two problems, consider the parse tree shown 
in Figure 15.6 for the following berp example. 

(15.21) I want to go to eat some Italian food today. 

The branching structure of this tree distributes the key components of the 
meaning representation widely throughout the tree. At the same time, most 
of the nodes in the tree contribute almost nothing to the meaning of this 
sentence. This structure requires three lambda-expressions and a complex 
term to bring the few contentful elements together at the top of the tree. 

The third problem arises from the need to have uniform semantic at¬ 
tachments in the compositional rule-to-rule approach. This requirement of¬ 
ten results in constituents that are at the right level of generality for the syn¬ 
tax, but too high a level for semantic purposes. A good example of this is 
the case of nominal compounds and adjective phrases, where the semantic 
attachments are so general as to be nearly meaningless. Consider, for exam¬ 
ple, the rule governing the phrase Italian food in our current example. 

Nominal —> Adj Nominal 

{Xx Nominalsem(x) A AM(x, Adj.sem)} 
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Applying this attachment results in the following meaning representation. 

3x Isa(x, Food) AAM(xJtalian) 

All nominals that fit this pattern receive the same vague interpretation that 
roughly indicates that the nominal is modified by the adjective. This is a far 
cry from what know that expressions like Italian food and Italian restaurant 
mean; they denote food prepared in a particular way, and restaurants that 
serve food prepared that way. Unfortunately, there is no way to get this very 
general rule to produce such an interpretation. 

Both of these problems can be overcome through the use of semantic 
grammars, which were originally developed for text-based dialog systems 
in the domains of question-answering and intelligent tutoring (Brown and 
Burton, 1975). Semantic grammars that are more directly oriented towards 
serving the needs of a compositional analysis. In this approach, the rules 
and constituents of the grammar are designed to correspond directly to enti¬ 
ties and relations from the domain being discussed. More specifically, such 
grammars are constructed so that key semantic components can occur to¬ 
gether within single rules, and rules are made no more general than is needed 
to achieve sensible semantic analyses. 

Let’s consider how these two general strategies might be applied in the 
berp domain. Consider the following candidate rule for the particular kind 
of information request illustrated in Example 15.21. 

InfoRequest User want to go to eat FoodType TimeExpr 

As with the rules introduced for idioms, rules of this type freely mix non¬ 
terminals and terminals on their right-hand side. In this case, User, FoodType , 
and TimeExpr represent semantically motivated non-terminal categories for 
this domain. Given this, the semantic attachment for this rule would have all 
the information that it needs to compose the meaning representation for re¬ 
quests of this type from the immediate constituents of the rule. In particular, 
there is no need for ^-expressions, since this flat rule elevates all the relevant 
arguments to the top of the tree. 

Now consider the following rule that could be used to parse the the 
phrase Italian food in our example. 

FoodType —> Nationality FoodType 

The specific nature of this rule permits a far more useful semantic attachment 
than is possible with the generic nominal rule given above. More specifically, 
it can create a representation that states that the food specified by the con- 
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stituent FoodType is to prepared in the style associated with the Nationality 
constituent. 

One of the key motivations for the use of semantic grammars in these 
domains was the need to deal with various kinds of anaphor and ellipsis. Se¬ 
mantic grammars can help with these phenomena since by their nature they 
enable a certain amount of prediction. More specifically, they allow parsers 
to make highly specific predictions about upcoming input, based on the cat¬ 
egories being actively predicted by the parser. Given this ability, anaphoric 
references and missing elements can be associated with specific semantic 
categories. 

As an example of how this works consider the following atis exam¬ 
ples. 

(15.22) When does flight 573 arrive in Atlanta? 

(15.23) When does it arrive in Dallas? 

Sentences like these can be analyzed with a rule like the following, which 
makes use of the domain specific non-terminals Flight and City. 

InfoRequest when does Flight arrive in City 

A rule such as this gives far more information about the likely referent 
of the it , than a purely syntactic rule that would simply restrict it to anything 
expressible as a noun phrase. Operationally, such a system might search 
back in the dialog for places where the Flight constituent has been recently 
used to find candidate references for this pronoun. Chapter 18 discusses the 
topic of anaphor resolution in more detail. 

Not surprisingly, there are a number of drawbacks to basing a system 
on a semantic grammar. The primary drawback arises from an almost com- 
reuse plete lack of reuse in the approach. Combining the syntax and semantics of 
a domain into a single representation makes the resulting grammar specific 
to that domain. In contrast, systems that keep their syntax and semantics 
separate can, in principle, reuse their grammars in new domains. A second 
lack of reuse arises as a consequence of eschewing syntactic generalizations 
in the grammar. This results in an unavoidable growth in the size of the 
grammar for a single domain. As an example of this, consider that whereas 
our original noun phrase rule was sufficient to cover both Italian restaurant 
as well as Italian food , we now need two separate rules for these phrases. 
In fact, inspection of the berp corpus reveals that we would also need also 
need additional rules for vegetarian restaurant , California restaurant , and 
expensive restaurant. 
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We should also note that semantic grammars are susceptible to a kind 
of semantic overgeneration. As an example of this, consider the phrase 
Canadian restaurant. It matches the rule given above for ethnic restaurants, 
and would result in a meaning representation that specifies a restaurant that 
serves food prepared in the Canadian style. Unfortunately, this is almost 
certainly an incorrect interpretation of this phrase; none of the occurrences 
of this phrase in the WSJ corpus had this meaning, all referring instead to 
restaurants located within Canada. Dialog systems that use semantic gram¬ 
mars rely on the rarity of such uses in restricted domains. 

Finally, we should note that semantic grammars probably should have 
been called something else, since in practice the grammars themselves are 
formally the same as any other grammar formalism we have discussed in 
this book. Correspondingly, there are no special algorithms for syntactic 
or semantic analysis specific to semantic grammars; they can use whatever 
algorithms are appropriate for the grammar formalism being employed, such 
as Earley, or any other context-free parsing algorithm. 

Information Extraction 

In language processing tasks such question-answering, coming to a reason¬ 
able understanding of each input sentence is vital since giving a user a wrong 
answer can have serious consequences. For these tasks, the rule-to-rule ap¬ 
proach with an eye towards semantics is a good way to build a complete 
interpretation of an input sentence. 

However, other tasks, like extracting information about joint ventures 
from business news, understanding weather reports, or summarizing simple 
information about what happened today on the stock market from a radio 
report, do not necessarily require this kind of detailed understanding. Such 
information extraction tasks are characterized by two properties: (1) the eTt°raction N 
desired knowledge can be described by a relatively simple and fixed tem¬ 
plate, or frame, with slots that need to be filled in with material from the template 
text, and (2) only a small part of the information in the text is relevant for 
filling in this frame; the rest can be ignored. 

For example, one of the tasks used in the fifth Message Understand¬ 
ing Conference (MUC-5) in 1993 (Sundheim, 1993), a U.S. Government- 
organized information extraction conference, was to extract information about 
international joint ventures from business news. Here are the first two sen¬ 
tences of a sample article from (Grishman and Sundheim, 1995): 

Bridgestone Sports Co. said Friday it has set up a joint venture in Tai- 
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Methodology Box: Evaluating Information Ex¬ 
traction Systems 


The information extraction paradigm has much in common with 
the field of information retrieval and has adapted several standard 
evaluation metrics front information retrieval including precision, 
recall, fallout, and a combined metric called an F-measure. 

Recall is a measure of how much relevant information the sys¬ 
tem has extracted from the text; it is thus a measure of the coverage 
of the system. Recall is defined as follows: 

„ ~ ___ # of correct answers given b y system 

eca ' — tofar#o'fpossible correct answers in the text 


Precision is a measure of how much of the information that the sys¬ 
tem returned is actually correct, and is also known as accuracy. Pre¬ 
cision is defined as follows: 


Precision: 


# of correct answers given by system 
# of answers given by system 


Fallout is a measure of the systems ability to ignore spurious infor¬ 
mation in die text. It is defined as follows: 


Fallout: — 


# of incorrect answers given by system 
# of spurious facts in the text 


Note that recall and precision are antagonistic to one another 
since a conservative system that strives for perfection in terms of 
precision will invariably lower its recall score. Similarly, a system 
that strives for coverage will get more things wrong, thus lowering 
its precision score. This situation has led to the use of a combined 
measure called the F-measure that balances recall and precision by 
using a parameter p. The F-measure is defined as follows: 

F (P 2 +QP/? 

When $ is one, precision and recall are given equal weight. When 
is greater than one, precision is favored, and when p is less than one, 
recall is favored. 
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TIE-UP-1: 


Relationship: 

TIE-UP 

Entities: 

“Bridgestone Sports Co.” 


“a local concern” 


“a Japanese trading house” 

Joint Venture Company 

“Bridgestone Sports Taiwan Co.” 

Activity 

ACTIVITY-1 

Amount 

NT$20000000 

ACTIVITY-1: 


Company 

“Bridgestone Sports Taiwan Co.” 

Product 

“iron and “metal wood” clubs” 

Start Date 

DURING: January 1990 

Figure 15.7 The templates produced by the fastus (Hobbs et al , 1997) 

information extraction engine given the input text on page 575. 


wan with a local concern and a Japanese trading house to produce golf 
clubs to be shipped to Japan. 

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 
million new Taiwan dollars, will start production in January 1990 with 
production of 20,000 iron and “metal wood” clubs a month. 

The output of an information extraction system can be a single template 
with a certain number of slots filled in, or a more complex hierarchically re¬ 
lated set of objects. The MUC-5 task specified this latter more complex out¬ 
put, requiring systems to produce hierarchically linked templates describing 
the participants in the joint venture, the resulting company, and its intended 
activity, ownership and capitalization. Figure 15.7 shows the resulting struc¬ 
ture produced by the fastus system (Hobbs et al , 1997). 

Many information extraction systems are built around cascades of finite- cascades 
state automata. The fastus system, for example, produces the template 
given above, based on a cascade in which each level of linguistic process¬ 
ing extracts some information from the text, which is passed on to the next 
higher level, as shown in Figure 15.8 

Many systems base all or most of these levels on finite-automata, al¬ 
though in practice most complete systems are not technically finite-state, 
either because the individual automata are augmented with feature registers 
(as in FASTUS), or because they are used only as preprocessing steps for full 
parsers (e.g. Gaizauskas et al, 1995; Weischedel, 1995) indexGaizauskas, 

R.), or are combined with other components based on decision-trees (Fisher 
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No. 

Step 

Description 

1 

Tokens: 

Transfer an input stream of characters 
into a token sequence. 

2 

Complex Words: 

Recognize multi-word phrases, numbers, 
and proper names. 

3 

Basic phrases: 

Segment sentences into noun groups, 
verb groups, and particles. 

4 

Complex phrases: 

Identify complex noun groups and com¬ 
plex verb groups. 

5 

Semantic Patterns: 

Identify semantic entities and events and 
insert into templates. 

6 

Merging: 

Merge references to the same entity or 
event from different parts of the text. 

Figure 15.8 Levels of processing in FASTUS(Hobbs et al ., 1997). Each 
level extracts a specific type of information which is then passed on to the next 
higher level. 


etal ., 1995). 

Let’s sketch the fastus implementation of each of these levels, fol¬ 
lowing Hobbs et al. (1997) and Appelt et al. (1995). After tokenization, the 
second level recognizes multiwords like set up , and joint venture , and names 
like Bridgestone Sports Co .. The name recognizer is a transducer, composed 
of a large set of specific mappings designed to handle locations, personal 
names, and names of organizations, companies, unions, performing groups, 
etc. The following are typical rules for modeling names of performing or¬ 
ganizations like San Francisco Symphony Orchestra and Canadian Opera 
Company. While the rules are written using a context-free syntax, there is 
no recursion and therefore they can be automatically compiled into finite- 
state transducers: 

Performer-Org (pre-location) Performer-Noun-i- Perf-Org-Suffix 

pre-location —y locname | nationality 

locname —> city | region 

Perf-Org-Suffix orchestra, company 

Performer-Noun symphony, opera 

nationality —> Canadian, American, Mexican 

city San Francisco, London 


The second stage also might transduce sequences like forty two into 
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the appropriate numeric value (recall the discussion of this problem on page 
124 in Chapter 5). 

The third fastus stage produces a series of basic phrases, such as 
noun groups, verb groups, etc., using finite-state rules of the sort shown on 
page 386. The output of the fastus basic phrase identifier is shown in Fig¬ 
ure 15.9; note the use of some domain-specific basic phrases like Company 
and Location . 


Company 

Bridgestone Sports Co. 

Verb Group 

said 

Noun Group 

Friday 

Noun Group 

it 

Verb Group 

had set up 

Noun Group 

a joint venture 

Preposition 

in 

Location 

Taiwan 

Preposition 

with 

Noun Group 

a local concern 

Conjunction 

and 

Noun Group 

a Japanese trading house 

Verb Group 

to produce 

Noun Group 

golf clubs 

Verb Group 

to be shipped 

Preposition 

to 

Location 

Japan 


Figure 15.9 The output of Stage 2 of the FASTUS basic-phrase extractor, 
which uses finite-state rules of the sort described by Appelt and Israel (1997) 
and shown on page 386. 


Recall that Chapter 10 described how these basic phrases can be com¬ 
bined into complex noun groups and verb groups. This is accomplished in 
Stage 4 of fastus, by dealing with conjunction and with the attachment of 
measure phrases as in the following. 

20,000 iron and ’’metal wood” clubs a month, 
and preposition phrases: 

production of 20,000 iron and ’’metal wood” clubs a month, 

The output of Stage 4 is a list of complex noun groups and verb groups. 
Stage 5 takes this list, ignoring all input that has not been chunked into a 
complex group, recognizes entities and events in the complex groups, and 
inserts the recognized objects into the proper templates. The recognition of 


BASIC 

PHRASES 
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(1) 

Relationship: 

Entities: 

TIE-UP 

“Bridgestone Sports Co.” 

“a local concern” 

“a Japanese trading house” 

(2) 

Activity 

PRODUCTION 


Product 

“golf clubs” 

(3) 

Relationship: 

TIE-UP 


Joint Venture Company 

“Bridgestone Sports Taiwan Co.” 


Amount 

NT$20000000 

(4) 

Activity 

PRODUCTION 


Company 

“Bridgestone Sports Taiwan Co.” 


Start Date 

DURING: January 1990 

(5) 

Activity 

PRODUCTION 


Product 

“iron and “metal wood” clubs” 

Figure 15.10 The five partial templates produced by Stage 5 of the fastus 

system. These templates will be merged by the Stage 6 Merging algorithm to 

produce the final template shown in Figure 15.7 on page 577. 


entities and events is done by hand-coded finite-state automata whose tran¬ 
sitions are based on particular complex-phrase types annotated by particular 
head words or particular features like company , currency , or date . 

For example, the first sentence of the news story above realizes the 
semantic patterns based on the following two regular expressions (where 
NG indicates Noun-Group and YG Verb-Group). 

• NG(Company/ies) YG(Set-up) NG( Joint-Venture) with NG(Company/ies) 

• VG(Produce)NG(Product) 

The second sentence realizes the second pattern above as well as the follow¬ 
ing two patterns: 

• NG(Company) VG-Passive(Capitalized) at NG(Currency) 

• NG(Company) YG(Start) NG(Activity) in/on NG(Date) 

The result of processing these two sentences is the set of five draft tem¬ 
plates shown in Figure 15.10. These five templates must then be merged into 
the single hierarchical structure shown in Figure 15.7. The merging algo¬ 
rithm decides whether two activity or relationship structures are sufficiently 
consistent that they might be describing the same events, and merges them 
if so. Since the merging algorithm must perform reference resolution (de¬ 
ciding when it is the case that two descriptions refer to the same entity), we 
defer description of this level to Chapter 18. 
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Domain-specific templates of the kind we have described in this sec¬ 
tion have also been used in many limited-domain semantic understanding 
and discourse comprehension tasks, including managing mixed dialog in 
question-answering systems (Bobrow etal , 1977). 


15.6 Summary 

This chapter explores the notion of syntax-driven semantic analysis. Among 
the highlights of this chapter are the following topics. 

• Semantic analysis is the process whereby meaning representations are 
created and assigned to linguistic inputs. 

• Semantic analyzers that make use of static knowledge from the lexicon 
and grammar can create context independent literal, or conventional, 
meanings. 

• The Principle ofCompositionality states that the meaning of a sentence 
can be composed from the meanings of its parts. 

• In syntax-driven semantic analysis, the parts are the syntactic con¬ 
stituents on an input. 

• Compositional creation of fopc formulas is possible with a few nota- 
tional extensions including ^-expressions and complex terms. 

• Natural language quantifiers introduce a kind of ambiguity that is dif¬ 
ficult to handle compositionally. Complex terms can be used to com¬ 
pactly encode this ambiguity. 

• Idiomatic language defies the principle of compositionality but can eas¬ 
ily be handled by adapting the techniques used to design grammar rules 
and their semantic attachments. 

• Practical semantic analysis systems adapt the strictly compositional 
approach in a number of ways. 

- Dialog systems based on semantic grammars rely on grammars 
that have been written to serve the needs of semantics rather than 
syntactic generality. 

- Information extraction systems based on cascaded automata can 
extract pertinent information while ignoring irrelevant parts of the 
input. 
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Bibliographical and Historical Notes 


As noted earlier, the principle of compositionality is traditionally attributed 
to Frege; Janssen (1997) discusses this attribution. Using the categorial 
grammar framework described in Chapter 12, Montague (1973) demonstrated 
that a compositional approach could be systematically applied to an inter¬ 
esting fragment of natural language. The rule-to-rule hypothesis was first 
articulated by (Bach, 1976). On the computational side of things, Woods’s 
Lunar system (Woods, 1977) was based on a pipelined syntax-first com¬ 
positional analysis. Schubert and Pelletier (1982) developed an incremental 
rule-to-rule system based on Gazdar’s GPSG approach (Gazdar, 1981, 1982; 
Gazdar et al , 1985). Main and Benson (1983) extended Montague’s ap¬ 
proach to the domain of question-answering. 

In one of the all too frequent cases of parallel development, researchers 
in programming languages developed essentially identical compositional tech¬ 
niques to aid in the design of compilers. Specifically, Knuth (1968) intro¬ 
duced the notion of attribute grammars that associate semantic structures 
with syntactic structures in a one to one correspondence. As a consequence, 
the style of semantic attachments used in this chapter will be familiar to users 
of the YACC-style (Johnson and Lesk, 1978) compiler tools. 

Semantic Grammars are due to Burton (Brown and Burton, 1975). 
Similar notions developed around the same time included Pragmatic Gram¬ 
mars (Woods, 1977), and Performance Grammars (Robinson, 1975). All 
centered around the notion of reshaping syntactic grammars to serve the 
needs of semantic processing. It is safe to say that most modem systems 
developed for use in limited domains make use of some form of semantic 
grammar. 

Most of the techniques used in the fragment of English presented in 
Section 15.2 are adapted from SRI’s Core Language Engine (Alshawi, 1992). 
Additional bits and pieces were adapted from (Woods, 1977; Schubert and 
Pelletier, 1982; Gazdar et al , 1985). Of necessity, a large number of im¬ 
portant topics were not covered in this chapter. See (Alshawi, 1992) for 
the standard gap-threading approach to semantic interpretation in the pres¬ 
ence of long-distance dependencies, ter Meulen (1995) presents an up to 
date treatment of tense, aspect, and the representation of temporal informa¬ 
tion. Extensive coverage of approaches to quantifier scoping can be found 
in (Hobbs and Shieber, 1987; Alshawi, 1992). van Lehn (1978) presents a 
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set of human preferences for quantifier scoping. Over the years, a consider¬ 
able amount of effort has been directed toward the interpretation of nominal 
compounds. Linguistic research on this topic can be found in (Lees, 1970; 
Downing, 1977; Levi, 1978; Ryder, 1994), more computational approaches 
are described in (Gershman, 1977; Finin, 1980; McDonald, 1982; Pierre, 
1984; Arens et al, 1987; Wu, 1992; Vanderwende, 1994; Lauer, 1995). 

There is a long and extensive literature on idioms. Fillmore et al 
(1988) describe a general grammatical framework that places idioms at the 
the center of its underlying theory. Makkai (1972) presents an extensive 
linguistic analysis of many English idioms. Hundreds of idiom dictionar¬ 
ies for second language learners are also available. On the computational 
side, Becker (1975) was among the first to suggest the use of phrasal rules 
in parsers. Wilensky and Arens (1980) were among the first to successfully 
make use of this notion. Zemik (1987) demonstrated a system that could 
learn such phrasal idioms in context. A collection of papers on computa¬ 
tional approaches to idioms appeared in (Fass et al , 1992). 

The first work on information extraction was performed in the context 
of the Frump system (DeJong, 1982). Later work was stimulated by the 
U.S government sponsored MUC conferences (Sundheim, 1991,1992,1993, 
1995b). Chinchor et al (1993) describes the evaluation techniques used in 
the MUC-3 and MUC-4 conferences. Hobbs (1997) partially credits the 
inspiration for fastus to the success of the University of Massachusetts 
circus system (Lehnert et al , 1991) in MUC-3. The SCISOR system is 
another system based loosely on cascades and semantic expectations that 
did well in MUC-3 (Jacobs and Rau, 1990). Due to the lack of reuse from 
one domain to another in information extraction, a considerable amount of 
work has focused on automating the process of knowledge acquisition in this 
area. A variety of supervised learning approaches are described in (Cardie, 
1993, 1994; Riloff, 1993; Soderland et al, 1995; Huffman, 1996; Freitag, 
1998). 

Finally, we have skipped an entire branch of semantic analysis in which 
expectations driven from deep meaning representations drive the analysis 
process. Such systems avoid the direct representation and use of syntax, 
rarely making use of anything resembling a parse tree. The earliest and most 
successful efforts along these lines were developed by Simmons (1973b, 
1978, 1983) and (Wilks, 1975a, 1975c). A series of similar approaches were 
developed by Roger Schank and his students (Riesbeck, 1975; Bimbaum and 
Selfridge, 1981; Riesbeck, 1986). In these approaches, the semantic analysis 
process is guided by detailed procedures associated with individual lexical 
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items. The CIRCUS information extraction system (Lehnert et al , 1991) 
traces its roots to these systems. 


Exercises 

15.1 The attachment given on page 560 to handle noun phrases with com- 
plex determiners is not general enough to handle most possessive noun phrases. 
Specifically, it doesn’t work for phrases like the following. 

a. My sister’s flight 

b. My fiance’s mother’s flight 

Create a new set of semantic attachments to handle cases like these. 

15.2 Develop a set of grammar rules and semantic attachments to handle 
predicate adjectives such as the one following. 

a. Flight 308 from New York is expensive. 

b. Murphy’s restaurant is cheap. 

15.3 None of the attachments given in this chapter provide temporal infor¬ 
mation. Augment a small number of the most basic rules to add temporal 
information along the lines sketched in Chapter 14. Use your rules to create 
meaning representations for the following examples. 

a. Flight 299 departed at 9 o’clock. 

b. Flight 208 will arrive at 3 o’clock. 

c. Flight 1405 will arrive late. 

15.4 As noted in Chapter 14, the present tense in English can be used to 
refer to either the present or the future. However, it can also be used to 
express habitual behavior, as in the following. 

Flight 208 leaves at 3 o’clock. 

This could be a simple statement about today’s Flight 208, or alterna¬ 
tively it might state that this flight leaves at 3 o’clock every day. Create a 
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fopc meaning representation along with appropriate semantic attachments 
for this habitual sense. 

15.5 Implement the Earley-based semantic analyzer described in Section 
15.3. 

15.6 It has been claimed that it is not necessary to explicitly list the seman¬ 
tic attachment for most grammar rules. Instead, the semantic attachment for 
a rule should be inferable from the semantic types of the rule’s constituents. 
For example, if a rule has two constituents where one is a single argument 
^-expression and the other is a constant then the semantic attachment should 
obviously apply the ^-expression to the constant. Given the attachments pre¬ 
sented in this chapter, does this type-driven semantics seem like a reasonable 
idea? 

15.7 Add a simple type-driven semantics mechanism to the Earley analyzer 
you implemented for Exercise 15.5 

15.8 Using a phrasal search on your favorite Web search engine, collect a 
small corpus of the tip of the iceberg examples. Be certain that you search 
for an appropriate range of examples (ie. don’ just search of “the tip of the 
iceberg”.) Analyze these examples and come up with a set of grammar rules 
that correctly accounts for them. 

15.9 Collect a similar corpus of examples for the idiom miss the boat . An¬ 
alyze these examples and come up with a set of grammar rules that correctly 
accounts for them. 

15.10 There are now a fair number of Web-based natural language question 
answering services that purport to provide answers to questions on a wide 
range of topics (see this book’s Web page for pointers to current services.) 
Develop a corpus of questions for some general domain of interest and use 
it to evaluate one or more of these services. Report your results. What 
difficulties did you encounter in applying the standard evaluation techniques 
to this task? 

15.11 Collect a small corpus of weather reports from your local newspaper 
or the Web. Based on an analysis of this corpus, create a set of frames 
sufficient to capture the semantic content of these reports. 

15.12 Implement and evaluate a small information extraction system for 
the weather report corpus you collected for the last exercise. 





LEXICAL SEMANTICS 


‘When I use a word* Humpty Dumpty said in rather a scornful 
tone, ‘it means just what I choose it to mean - neither more nor 
less / 


Lewis Carrol’s Alice in Wonderland 

How many legs does a dog have if you call its tail a leg? 
Four. 

Calling a tail a leg doesn't make it one. 

Attributed to Abraham Lincoln 


A revised version of this chapter will be available shortly. 

The previous two chapters focused on representing and creating mean¬ 
ing representations for entire sentences. In those discussions, we made min¬ 
imal use of the notion of the meaning of a word. Words and their meanings 
were of interest solely to the extent that they provided the appropriate bits 
and pieces necessary to construct adequate meaning representations for en¬ 
tire sentences. This general approach is motivated by the view that while 
words may contribute content to the meanings of sentences, they do not 
themselves have meanings. By this we mean that words, by themselves, 
do not refer to the world, can not be judged to be true or false, or literal 
or figurative, or a host of other things that are generally reserved to entire 
sentences and utterances. This narrow conception of the role of words in a 
semantic theory leads to a view of the lexicon as a simple listing of symbolic 
fragments devoid of any systematic structure. 
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The topics presented in this chapter serve to illustrate how much is 
missed by this narrow view. As we will see, the lexicon has a highly system¬ 
atic structure that governs what words can mean, and how they can be used. 
This structure consists of relations among words and their meanings, as well 
as the internal structure of individual words. The study of this systematic, 
semantics meaning related, structure is called Lexical Semantics. 

Before moving on, we will first introduce a few new terms, since the 
ones we have been using thus far are entirely too vague. In particular, the 
word word has by now been used in so many different ways that it will 
prove difficult to make unambiguous use of it in this chapter. Instead, we 
lexeme will focus on the notion of a lexeme, an individual entry in the lexicon. 

A lexeme should be thought of as a pairing of a particular orthographic and 
phonological form with some form of symbolic meaning representation. The 
lexicon is therefore a finite list made up of lexemes. When appropriate, we 
will use the terms orthographic form, and phonological form, to refer to the 
sense appropriate form part of this pairing, and the term sense to refer to a lexeme’s 
meaning component. Note that these definitions will undergo a number of 
refinements as needed in later sections. 

Given this minimal nomenclature, let us return to the topic of what 
facts we can discover about lexemes that are relevant to the topic of meaning. 
A fruitful place to start such an exploration is a dictionary. Dictionaries are, 
after all, nothing if not repositories of information about the meanings of 
lexemes. Within dictionaries, it turns out that the most interesting place to 
look first is at the definitions of lexemes that no one ever actually looks up. 
For example, consider the following fragments from the definitions of right , 
left, red , blood from the American Heritage Dictionary( Morris, 1985). 

right ad] located nearer the right hand esp. being on the right when facing 
the same direction as the observer, 
left ad] located nearer to this side of the body than the right, 
red n the color of blood or a ruby. 

blood n the red liquid that circulates in the heart, arteries and veins of animals. 

The first thing to note about these definitions is the surprising amount 
of circularity in them. The definition of right makes two direct references to 
itself, while the entry for left contains an implicit self-reference in the phrase 
this side of the body , which presumably means the left side. The entries for 
red and blood avoid this kind of direct self-reference by instead referencing 
each other in their definitions. Such circularity is, of course, inherent in all 
dictionary definitions, these examples are just extreme cases. In the end, all 
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definitions are stated in terms of lexemes that are, in turn, defined in terms 
of other lexemes. 

From a purely formal point of view, this inherent circularity is evidence 
that what dictionaries entries provide are not, in fact, definitions at all. They 
are simply descriptions of lexemes in terms of other lexemes, with the hope 
being that the user of the dictionary has sufficient grasp of these other terms 
to make the entry in question sensible. As is obvious with lexemes like red 
and right , this approach will fail without some ultimate grounding in the 
external world. 

Fortunately, even with this limitation, there is still a wealth of semantic 
information contained in these kinds of definitions. For example, the above 
definitions make it clear that right and left are similar kinds of lexemes that 
stand in some kind of alternation, or opposition, to one another. Similarly, 
we can glean that red is a color, it can be applied to both blood and rubies , 
and that blood is a liquid. As we will see in this chapter, given a sufficiently 
large database of facts such as these, many applications are quite capable 
of performing sophisticated semantic tasks (even if they do not really know 
their right from their left.) 

To summarize, we can capture quite a bit about the semantics of in¬ 
dividual lexemes by analyzing and labeling their relations to other lexemes 
in various settings. We will, in particular, be interested in accounting for 
the similarities and differences among different lexemes in similar settings, 
and the nature of the relations among lexemes in a single setting. This lat¬ 
ter topic will lead us to examine the idea that lexemes are not unanalyzable 
atomic symbols, but rather have an internal structure that governs their com¬ 
binatoric possibilities. Later, in Section 16.4, we will take a closer look at 
the notion of creativity, or generativity, and the lexicon. There we will ex¬ 
plore the notion that the lexicon should not be thought of as a finite listing, 
but rather as a creative generator of infinite meanings. 

Before proceeding, we should note that the view of lexical seman¬ 
tics presented here is not oriented solely towards improving computational 
applications of the more restrictive “only sentences have meaning” variety. 
Rather, as we will see, it lends itself to a wide array of applications that in¬ 
volve the use of words, and that could can be improved by some knowledge 
of their meanings. 
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16.1 Relations Among Lexemes and Their Senses 

The section explores a variety of relations that hold among lexemes and 
among their senses. The list of relations presented here is by no means 
exhaustive; the emphasis is on those relations that have had significant com¬ 
putational implications. As we will see, the primary analytic tool we will 
use involves the systematic substitution of one lexeme for another in some 
setting. The results of such substitutions can reveal the presence or absence 
of a specific relationship between the substituted lexemes. 

Homonymy 

homonymy We begin this section with a discussion of homonymy, perhaps the sim¬ 
plest, and semantically least interesting, relation to hold between lexemes. 
Traditionally, homonymy is defined as a relation that holds between words 
that have the same form with unrelated meanings. The items taking part in 
homonyms such a relation are called homonyms. A classic example of homonymy is 
bank with its distinct financial institution and sloping mound meanings, as 
illustrated in the following WSJ examples. 

(16.1) Instead, a bank can hold the investments in a custodial account in the 
client’s name. 

(16.2) But as agriculture burgeons on the east bank , the river will shrink 
even more. 

Loosely following lexicographic tradition, we will denote this relationship 
by placing a superscript on the orthographic form of the word as in bank 1 
and bank 2 . This notation indicates that these are two separate lexemes, with 
distinct and unrelated meanings, that happen to share an orthographic form. 

It will come as little surprise that any definition this simple will prove 
to be problematic and will need to be refined. In the following discussion, 
we will explore this definition by examining pairs of words that satisfy it, 
but which for a number of reasons seem to be marginal examples. We will 
begin by focusing solely on issues of form, returning later to the topic of 
meaning Note that while this may seem like an odd choice given the topic of 
this chapter, these discussions will serve to introduce a number of important 
distinctions needed in later sections. In this discussion, we will be primarily 
concerned with how well our definition of homonymy assists us in identify¬ 
ing and characterizing those lexemes which will lead to ambiguity problems 
for various applications. 
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HOMO¬ 

PHONES 


Returning to the bank example, the first thing to note is that bank 1 and 
bank 2 are identical in both their orthographic and phonological forms. Of 
course, there are also pairs of lexemes with distinct meanings which do not 
share both forms. For example, pairs like wood and would , and be and bee , 
are pronounced the same but are spelled differently. Indeed, as we saw in 
Chapter 5, when pronunciation in context is taken into account, the situation 
is even worse. Recall, that the lexemes knee , need , neat , new , you , the , and 
to can all be pronounced as [ni], given the right context. Clearly, if the notion 
of form in our definition of homonymy includes a word’s phonological form 
in context, there will are be a huge number of homonyms in English. 

Of course, none of these examples are traditionally be considered good 
candidates for homonymy. The notion of homonymy is most closely asso¬ 
ciated with the field of lexicography, where normally only dictionary en¬ 
tries with identical citation-forms are considered candidates for homonymy. forms n 
Citation-forms are the orthographic-forms that are used to alphabetically in¬ 
dex words in a dictionary, which in English correspond to what we have been 
calling the root form of a word. Under this view, words with the same pro¬ 
nunciation but different spellings are not considered homonyms, but rather 
homophones, distinct lexemes with a shared pronunciation. 

Of course, there are also pairs of lexemes with identical orthographic 
forms with different pronunciations. Consider, for example, the distinct fish 
and music meanings associated with the orthographic form bass in the fol¬ 
lowing examples. 

(16.3) The expert angler from Dora, Mo., was fly-casting for bass rather 
than the traditional trout. 

(16.4) The curtain rises to the sound of angry dogs baying and ominous 
bass chords sounding. 

While these examples more closely fit the traditional definition of homonymy, 
they would only rarely appear in any traditional list of homonyms. Instead, 
lexemes with the same orthographic form with unrelated meanings are called 

homographs. 

Finally, we should note that lexemes with different parts of speech are 
also typically not considered to be good candidates for homonymy. This 
restriction serves to rule out examples such as would and wood , on grounds 
other than their orthography. The basis for this restriction is two-fold: first 
as we saw when we discussed part-of-speech tagging, lexemes with such 
different parts of speech are easily distinguished based on their differing 
syntactic environments, and secondly lexical items can take on many distinct 


HOMO¬ 
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forms based on their inflectional and derivational morphology, which is in 
turn largely based on part-of-speech. 

To complicate matters, the issue of differing morphology can also oc¬ 
cur with lexemes that have the same part-of-speech. Consider the lexemes 
find and found in their locating and creating an institution meanings, as il¬ 
lustrated in the following WSJ examples. 

(16.5) He has looked at 14 baseball and football stadiums and found that 
only one - - private Dodger Stadium - brought more money into a city 
than it took out. 

(16.6) Culturally speaking, this city has increasingly displayed its 
determination to found the sort of institutions that attract the esteem 
of Eastern urbanites. 

Here we have two lexemes with distinct root forms, find and found , that 
nevertheless share the morphological variant found as the past tense of the 
first, and the root of the second. 

At this point, having raised all of these complexities, we might cre¬ 
ate a more refined definition for homonymy as two lexemes with unrelated 
meanings, the same part of speech, and identical orthographic and phonolog¬ 
ical forms in all possible morphological derivations. Under this definition, 
all homonyms would also be both homographs and homophones, with the 
converse not necessarily being the case. Under this new definition, most of 
the homographs and homophones presented earlier would be ruled out as 
homonyms. 

Such definitional exercises, however, merely obscure our reason for 
raising the issue of homonymy in the first place; homonymy is of interest 
computationally to the extent that it leads an application into dealing with 
ambiguity. Whether or not a given pair of lexemes cause ambiguity to arise 
in an application is entirely dependent on the nature of the application. As we 
will see in the following discussion of various applications, distinguishing 
perfect examples of homonymy from imperfect examples is of very little 
practical value. The critical issue is whether the nature of the form overlap 
is likely to cause difficulties for a given application. 

In spelling correction, homophones can lead to real-word spelling er¬ 
rors, or malapropisms, as when lexemes such as weather and whether are 
interchanged. Note that this is a case where a phonological overlap causes a 
problem for a purely text-based system. Additional problems in spelling cor¬ 
rection are caused by such imperfect homographs as find and found , which 
have partially overlapping morphologies. In this case, a word-form like 
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founded may represent a correct use of the past tense, or an incorrect over¬ 
application of the regular past tense rule to an irregular verb. 

In speech recognition, homophones such as to , two and too cause ob¬ 
vious problems. What is less clear, however, is that perfect homonyms such 
as bank are also problematic. Recall that speech recognition systems rely 
on language models that are often based on tables of N-gram probabilities. 

For perfect homonyms, the entries for all the distinct lexemes are conflated 
despite the fact that the different lexemes occur in different environments. 

This conflation results in inappropriately high probabilities to words that are 
cohorts of the lexeme not in use, and lower than appropriate probabilities to 
the correct cohorts. 

Finally, text-to-speech systems are vulnerable to homographs with 
distinct pronunciations. This problem can be avoided to some extent with ex¬ 
amples such as conduct whose different pronunciations are associated with 
the distinct parts of speech through the use of part-of-speech tagging, How¬ 
ever, for other examples like bass the two lexemes must be distinguished 
by some other means. Note that this situation is the reverse of the one we 
had with spelling correction, here a fundamentally speech-oriented system 
is being plagued by an orthographic problem. 

Polysemy 

Having muddied the waters discussing issues of form and homonymy, let 
us return to the topic of what it means for two meanings to be related or 
unrelated. Recall that the definition of homonymy requires that the lexemes 
in question have distinct and unrelated meanings. This is the crux of the 
matter; if the meanings in question are related in some way then we are 
dealing with a single lexeme with more than one meaning, rather than two 
separate lexemes. This phenomenon of a single lexeme with multiple related 
meanings is known as polysemy. Note that earlier we had defined a lexeme polysemy 
as a pairing between a surface form and a sense. Here we will expand that 
notion to be a pairing of a form with a set of related senses. 

To make this notion more concrete, consider the following bank exam¬ 
ple from the WSJ corpus. 

(16.7) While some banks furnish sperm only to married women, others are 

much less restrictive. 

Although this is clearly not a use of the sloping mound meaning of bank , 
it just as clearly is not a reference to a promotional giveaway at a financial 
institution. One way to deal with this use would be to create bank 3 , yet 



594 


Chapter 16. Lexical Semantics 


another distinct lexeme associated with the form bank , and give it a mean¬ 
ing appropriate to this use. Unfortunately, according to our definition of 
homonymy, this would require us to say that the meaning of bank in this ex¬ 
ample is distinct and unrelated to the financial institution sense, which seems 
to be far too strong a statement. The notion of polysemy allows us to state 
that this sense of bank is related to, and possibly derived from, the financial 
institution sense, without asserting that it is a distinct lexeme. 

As one might suspect, the task of distinguishing homonyny from pol¬ 
ysemy is not quite as straightforward as we made it seem with these bank 
examples. There are two criteria that are typically invoked to determine 
whether or not the meanings of two lexemes are related or not: the history, 
etymology or etymology, of the lexemes in question, and how the words are conceived 
of by native speakers. In practice, an ill-defined combination of evidence 
from these two sources is used to distinguish homonymous from polysemous 
lexical entries. In the case of bank , the etymology reveals that bank 1 has an 
Italian origin, while bank 2 is of Scandinavian origin, thus encouraging us to 
list them as distinct lexemes. On the other hand, our belief that the use of 
bank in Example 16.7 is related to bank 1 is based on introspection about the 
similarities of their meanings, and the lack of any etymological evidence for 
an independent third sense. 

In the absence of detailed etymological evidence, a useful intuition to 
use in distinguishing homonymy from polysemy is the notion of coincidence. 
Cases of homonymy can usually be understood easily as accidents of history 
- two lexemes which have coincidentally come to share the same form. On 
the other hand, it is far more difficult to accept cases of polysemy as coinci¬ 
dences. Returning again to our bank example, it is difficult to accept the idea 
that the various uses of bank in all of its various repository senses are only 
coincidentally related to the savings institution sense. 

Once we have determined that we are dealing with a polysemous lex¬ 
eme, we are of course still left with the task of managing the potentially 
numerous polysemous senses associated with it. In particular, for any given 
single lexeme we would like to be able to answer the following questions. 

• What distinct senses are there? 

• How are these senses related? 

• How can they be reliably distinguished? 

The answers to these questions can have serious consequences for well how 
semantic analyzers, search engines, generators, and machine translation sys¬ 
tems perform their respective tasks. The first two questions will be covered 
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here and in Section 16.4, while the final question will be covered in depth in 
Chapter 17. 

The issue of deciding how many distinct senses should be associated 
with a given polysemous lexeme is a task that has long vexed lexicographers, 
who until recently have been the only people engaged in the creation of large 
lexical databases. Most lexicographers take the approach of creating entries 
with as many senses as necessary to account for all the fine distinctions in 
meaning observed in some very large corpus of examples. This is a reason¬ 
able approach given that the primary use for a traditional dictionary is to 
assist users in learning the various uses of a word. Unfortunately, it tends to 
err on the side of making more distinctions than are normally required for 
any reasonable computational application. 

To make this notion of distinguishing distinct senses more concrete, 
consider the following uses of the verb serve from the WSJ corpus. 

(16.8) They rarely serve red meat, preferring to prepare seafood, poultry or 
game birds. 

(16.9) He served as U.S. ambassador to Norway in 1976 and 1977. 

(16.10) He might have served his time, come out and led an upstanding life. 

Reasonable arguments can be made that each of these examples rep¬ 
resents a distinct sense of serve. For example, the implicit contrast be¬ 
tween serving red meat and preparing seafood in the first example indicates 
a strong connection between this sense of serve and the related notion of 
food preparation. Since there is no similar component in any of the other 
examples, we can assume that this first use is distinct from the other two. 

Next, we might note that the second example has a different syntactic sub¬ 
categorization from the others since its first argument, which denotes the 
role played by the subject, is a prepositional phrase. As will be discussed 
in Section 16.3, such differing syntactic behaviors are often symptomatic of 
differing underlying senses. Finally, the third example is specific to the do¬ 
main of incarceration. This is clear since this example provides almost no 
specific information about prison, and yet has an obvious and clear meaning; 
a meaning which plays no role in the other examples. 

Another practical technique, for determining if two distinct senses are 
present is to combine two separate uses of a lexeme into a single example 
using a conjunction, a device has the rather improbable name of zeugma, zeugma 
Consider the following ATIS examples. 

(16.11) Which of those flights serve breakfast? 

(16.12) Does Midwest express serve Philadelphia? 
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(16.13) ?Does Midwest express serve breakfast and Philadelphia? 

The oddness of invented third example indicates there is no sensible way to 
make a single sense of serve work for both breakfast and Philadelphia. More 
precisely, the underlying concepts invoked by serve in the first example can 
not be applied in any meaningful way to Philadelphia. This is an instance 
where we can make use of examples from a corpus along with our native 
intuitions in a structured way to discover the presence or distinct senses. 

The issue of discovering the proper set of senses for a given lexeme is 
distinct from the process of determining which sense of a lexeme is being 
used in a given example. This latter task is called word sense disambigua- 
disamb1gua e tion, or word sense tagging by analogy to part-of-speech tagging, and is 
covered in detail in Chapter 17. As this analogy implies, the task typically 
presumes that a fixed set of senses can be associated with each lexical item, 
a dubious proposition that we will take up in Section 16.4. 

Finally, let us turn briefly to the topic of relatedness among the various 
senses of a single polysemous lexeme. Earlier, we made an appeal to the 
intuition that the polysemous senses of a lexeme are unlikely to have come 
about by coincidence. This raises the obvious question that if they are not 
related by coincidence, how are they related. This question has not received 
much attention from those constructing large lexicons since as long as the 
lexicon contains the correct senses, how they came to be there is largely 
irrelevant. However, as soon as applications begin to deal with a wide variety 
of inputs, they encounter novel uses that do not correspond to any of the 
static senses in the system’s lexicon. By examining the systematic relations 
among listed senses, we can gain insight into the meanings of such novel 
uses. These notions will be discussed in more detail in Section 16.4. 

Synonymy 

The phenomenon of synonymy is sufficiently widespread to account for the 
popularity of both thesauri and crossword puzzles. As with homonymy, the 
synonymy notion of synonymy, has a deceptively simple definition: different lexemes 
with the same meaning. Of course, this definition leaves open the question 
of what it means for two lexemes to mean the same thing. Although Sec¬ 
tion 16.3 will provide some answers to this question, we can make progress 
fuTABiuTY without answering it directly by invoking the notion of substitutability: two 
lexemes will be considered synonyms if they can substituted for one another 
in a sentence without changing either the meaning or the acceptability of 
the sentence. The following ATIS examples illustrate this notion of substi- 
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tutability. 

(16.14) How big is that plane? 

(16.15) Would I be flying on a large or small plane? 

Exchanging big and large in these examples has no noticeable effect 
on either the meaning or acceptability of these sentences. We can take this 
as evidence for the synonymy of big and large , at least for these examples. 
Note that this is intended to be a very narrow statement. In particular, we are 
not saying anything about the relative likelihood of occurrence of big and 
large in contexts similar to these. 

Not surprisingly, if we take the notion of substitutability to mean sub¬ 
stitutable in all possible environments, then true synonyms in English are few 
and far between, as it is almost always possible to find some sentence where 
a purported synonym fails to substitute successfully. Given this, we will fall 
back on a weaker notion that allows us to call two lexemes synonyms if they 
are substitutable in some environment. This is, for all practical purposes, the 
notion of synonymy used in most dictionaries and thesauri. 

The success or failure of the substitution of a given pair of candidate 
synonyms in a given setting depends primarily on four influences: polysemy, 
subtle shades of meaning, collocational constraints, and register. As we will 
see, only the first two involve the notion of meaning. 

To explore the effect of polysemy on substitutability, consider the fol¬ 
lowing WSJ example where a substitution of large for big clearly fails. 

(16.16) Miss Nelson, for instance, became a kind of big sister to Mrs. Van 
Tassel’s son, Benjamin. 

(16.17) ?Miss Nelson, for instance, became a kind of large sister to Mrs. 
Van Tassel’s son, Benjamin. 

The source of this failure is the fact that the lexeme big has as one of its 
distinct polysemous senses the notion of being older, or grown up. Since 
the lexeme large lacks this sense among its many meanings, it is not sub¬ 
stitutable for big in those environments where this sense is required. In this 
instance, the result is a sentence with a different meaning altogether. In other 
cases, such a substitution may result in a sentence that is either odd or en¬ 
tirely uninterpretable. 

We referred to the next influence on synonymy as shades of mean¬ 
ing. By this, we have in mind cases where two lexemes share a central core 
meaning, but where additional ancillary facts are associated with one the 
lexemes. Consider the use of the lexemes price and fare in the ATIS corpus. 
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Semantically, both have the notion of the cost for a service at the core of 
their meanings. They are not, however, freely interchangeable. Consider the 
following ATIS examples. 

(16.18) What is the cheapest first class fare? 

(16.19) ?What is the cheapest first class price? 

Exchanging price for fare in this example leads to a certain amount of 
oddity. The source of this oddness is hard to pin down, but fare seems to be 
better suited to the costs for various services (ie. coach, business and first 
class fares), while price seems better applied to the tickets that represent 
these services. Of course, a more complete account of how these lexemes 
are used in this domain would require a systematic analysis of a corpus of 
examples. The point is that although these terms share a core meaning, there 
are subtle meaning-related differences that influence how they can be used. 

These two influences on substitutability clearly involve the meanings 
of the lexical items. There are, however, other influences on the success 
or failure of a synonym substitution that are not based on meaning in any 
direct way. Collocational constraints are one such influence. By a colloca¬ 
tional constraint, we mean the kind of arbitrary associations, or attractions, 
between lexical items that were captured using techniques such as N-grams 
in Chapter 6. 

Consider the following WSJ example. 

(16.20) We frustrate ’em and frustrate ’em, and pretty soon they make a big 
mistake. 

(16.21) ?We frustrate ’em and frustrate ’em, and pretty soon they make a 
large mistake. 

As this example illustrates, there is a preference for using big rather than 
large when referring to mistakes of a critical or important nature . This is 
not due to a polysemy difference, nor does it seem to be due to any subtle 
shaded meaning difference between big and large. Note also, that this is 
clearly different than the large sister example in that a large mistake is still 
interpretable in the correct way; it just does not seem as natural to use large 
as big. Therefore, in this case, we must say that there is simply an arbitrary 
preference for big as opposed to large as applied to mistakes. 
register Finally, by register, we mean the social factors that surround the use of 

possible synonyms. Here we are referring to lexemes with essentially identi¬ 
cal meanings that are not interchangeable in all environments due to factors 
such as politeness, group status, and other similar social pressures. For ex- 
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ample, multisyllabic lexemes with Latin or Greek origins are often used in 
place of shorter lexemes when a technical or academic style is desired. 

As was the case with homonymy, these influences on synonymy have 
differing practical implications for computational applications. In Chapters 
19 and 20, we will see that similarity of meaning, collocational constraints, 
and appropriateness of use are of great importance in natural language gen¬ 
eration and machine translation. On the other hand, in the domains of infor¬ 
mation extraction and information retrieval, appropriateness of use is of far 
less consequence than the notion of identity of meaning. 

Hyponymy 

In our discussion of price and fare , we introduced the notion of pairs of 
lexemes with similar but non-identical meanings. The notion of hyponymy 
is based on a restricted class of such pairings: pairings where one lexeme 
denotes a subclass of the other. For example, the relationship between car 
and vehicle is one of hyponymy. Since this relation is not symmetric we will 
refer to the more specific lexeme as a hyponym of the more general one, 
and conversely to the more general term as a hypernym of the more specific 
one. We would therefore say that car is a hyponym of vehicle , and vehicle is 
hypernym of car. 

As with synonymy, we can explore the notion of hyponymy by making 
use of a restricted kind of substitution. Consider the following schema. 

That is a x. => That is a y. 

If x is a hyponym of y, then in any situation where the sentence on the left 
is true, the newly created sentence on the right must also be true, as in the 
following example. 

That is a car. => That is a vehicle. 

There a number of important differences between this kind of lim¬ 
ited substitution and the kind of substitutions discussed with respect to syn¬ 
onymy. There the resulting sentence could plausibly serve as a substitute for 
the original sentence. Here, the new sentence is not intended to be a sub¬ 
stitution for the original, rather it is merely serves as diagnostic test for the 
presence of hyponomy. 

The concept of hyponymy is closely related to a number of other no¬ 
tions that play central roles in biology, linguistic anthropology and computer 
science. 

The term ontology usually refers to an analysis of some domain, or mi¬ 
croworld, into a set of distinct objects. A taxonomy is a particular arrange- 
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ment of the elements of an ontology into a tree-like class inclusion structure. 
Normally, there are a set of well-formedness constraints on taxonomies that 
go beyond their component class inclusion relations. For example, the lex¬ 
emes hound , mutt, and puppy are all hyponyms of dog , but it would be odd 
to construct a taxonomy from those pairs since the concepts motivating the 
relations is different in each case. Finally, the computer science notion of 
an object hierarchy is based the notion that objects from an ontology ar¬ 
ranged in a taxonomy, can receive, or inherit, features from their ancestors 
in a taxonomy. This, of course, only makes sense when the elements in the 
taxonomy are in fact complex structured objects with features to be inherited. 

Therefore, sets of hyponymy relations, by themselves, do not consti¬ 
tute an ontology, category structure, taxonymy, or object hierarchy. They 
have, however, proved to be useful as approximations to such structures. We 
will return to the topic of hyponymy in Section 16.2 when we discuss the 
WordNet database. 


WordNet: A Database of Lexical Relations 

The widespread use of lexical relations in linguistic, psycholinguistic, and 
computational research has led to a number of efforts to create large elec¬ 
tronic databases of such relations. These efforts have, in general, followed 
one of two basic approaches: mining information from existing dictionaries 
and thesauri, and handcrafting a database from scratch. Despite the obvious 
advantages of reusing existing resources, WordNet, the most well-developed 
and widely used lexical database for English, was developed using the latter 
approach (Beckwith etal , 1991). 

WordNet consists of three separate databases, one each for nouns and 
verbs, and a third for adjectives and adverbs; closed class lexical items are 
not included in WordNet. Each of the three databases consists of a set of 
lexical entries corresponding to unique orthographic forms, accompanied by 
sets of senses associated with each form. Figure 16.1 gives some idea of the 
scope of the current, WordNet 1.6, release. The databases can be accessed 
directly with a browser (locally or over the Internet), or programmatically 
through the use of a set of C library functions. 

In their most complete form, WordNet’s sense entries consist of a set 
of synonyms, a dictionary-style definition, or gloss, and some example uses. 
Figure 16.2 shows an abbreviated version of the wordnet entry for the noun 
bass. As this entry illustrates, there are several important differences be- 
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tween WordNet entries and our notion of a lexeme. First, since WordNet 
contains no phonological information, it makes no attempt to keep sepa¬ 
rate lexemes with distinct pronunciations. For example, in this entry bass 4 , 
bass 5 , and bass 8 all refer to the [b ae s] fish sense, while the others refer 
to the [b ey s] musical sense. More generally, WordNet makes no attempt 
to distinguish homonymy from polysemy. For example, as far as this en¬ 
try is concerned, bass 1 bears the same relationship to bass 2 as it does to 
bass 4 . This is a conservative strategy that reflects the fact that although 
there are fairly reliable diagnostics for discriminating among distinct word 
senses, systematically organizing the resulting polysemous senses is a much 
more uncertain and subjective activity. Given this, the developers of Word- 
Net have opted to simply list distinct senses, without attempting to explicitly 
organize them in the hierarchical manner seen in many dictionaries. 

Figures 16.3 and 16.4 give a rough idea of how these senses are dis¬ 
tributed throughout the database. The distributions are extremely skewed, 
with a small number of entries having a large number of senses, and a large 


Category 

Unique Forms 

Number of Senses 

Noun 

94474 

116317 

Verb 

10319 

22066 

Adjective 

20170 

29881 

Adverb 

4546 

5677 

Figure 16.1 Scope of the current WordNet 1.6 release in terms of unique 
entries and total number of senses for the four databases. 


The noun “bass” has 8 senses in WordNet. 

1. bass - (the lowest part of the musical range) 

2. bass, bass part - (the lowest part in polyphonic music) 

3. bass, basso - (an adult male singer with the lowest voice) 

4. sea bass, bass - (flesh of lean-fleshed saltwater fish of the family Serranidae) 

5. freshwater bass, bass - (any of various North American lean-fleshed freshwater 

fishes especially of the genus Micropterus) 

6. bass, bass voice, basso - (the lowest adult male singing voice) 

7. bass - (the member with the lowest range of a family of musical instruments) 

8. bass - (nontechnical name for any of numerous edible marine and 

freshwater spiny-finned fishes) 


Figure 16.2 The WordNet 1.6 entry for the noun bass. 
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number having a single sense. Distributions like this are ubiquitous when 
dealing with the lexicon, and are referred to as Zipf distributions (Zipf, 
1949). Note also that the degree of polysemy in the verb database is higher 
than in the noun database. This is consistent with the fact that there are far 
fewer verbs than nouns in English and their meanings are far more malleable. 
Finally, we should note that these polysemy distributions correlate well with 
actual word frequency and led the WordNet developers to use degree of pol¬ 
ysemy as a proxy for frequency in the database. 



Polysemy Rank 


Figure 16.3 Distribution of senses among the nouns in WordNet. 


Of course, a simple listing of lexical entries would not be much more 
useful than an ordinary dictionary. The power of WordNet lies in its set 
of domain-independent lexical relations. These relations can hold among 
WordNet entries, senses, or sets of synonyms. They are, for the most part, 
restricted to items with the same part-of-speech, or more pragmatically, to 
items within the same database. Figures 16.5, 16.6, and 16.7 show a subset 
of the relations associated with each of the three databases, along with a 
brief explanation and an example. Since a full discussion of the contents 
of WordNet is beyond the scope of this text, we will limit ourselves to a 
discussion of two of its most useful and well-developed features: its sets of 
synonyms, and its hyponymy relations. 

The fundamental basis for synonymy in WordNet is the same as that 
given on page 596. Two WordNet entries are considered synonyms if they 
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Figure 16.4 Distribution of senses among the verbs in WordNet. 


Relation 

Definition 

Example 

Hyperym 

Hyponym 

Has-Member 

Member-Of 

Has-Stuff 

Stuff-Of 

Has-Part 

Part-Of 

Antonym 

From concepts to superordinates 

From concepts to subtypes 

From groups to their members. 

From members to their groups. 

From things to what they’re made of. 
From stuff to what it makes up. 

From wholes to parts 

From parts to wholes. 

Opposites 

breakfast —> meal 
meal lunch 
faculty —> professor 
copilot —> crew 

table —> leg 
course —> meal 
leader —> follower 

Figure 16.5 Noun Relations in WordNet. 


Relation 

Definition 

Example 

Hypemym 

Troponym 

Entails 

Antonym 

From events to superordinate events 
From events to their subtypes 

From events to the events they entail 
Opposites 

fly travel 
walk stroll 
snore —> sleep 
increase decrease 

Figure 16.6 Verb Relations in WordNet. 
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Relation 

Definition 

Example 

Antonym 

Opposite 

heavy light 

Adverb 

Opposite 

quickly <=> slowly 

Figure 16.7 

Adjective and Adverb Relations in WordNet. 


can be successfully substituted in some context. The particular theory and 
implementation of synonymy in WordNet is organized around the notion of 
synset a synset, a set of synonyms. Consider the following example of a synset. 

{chump, fish, fool, gull, mark, patsy, fall guy, 
sucker, schlemiel, shlemiel, soft touch, mug} 

The dictionary-like definition, or gloss, of this synset describes it as a per¬ 
son who is gullible and easy to take advantage of. Each of the lexical entries 
included in the synset can, therefore, be used to express this notion in some 
setting. In practice, synsets like this one actually constitute the senses asso¬ 
ciated with many WordNet entries. Specifically, it is this exact synset, with 
its associated definition and examples, that makes up one of the senses for 
each of the entries listed in the synset. 

Looking at this from a more theoretical perspective, each synset can 
be taken to represent a concept that has become lexicalized in the language. 
Synsets are thus somewhat analogous to the kinds of concepts we discussed 
in Chapter 14. Instead of representing concepts using logical terms, Word- 
Net represents them as lists comprised of the lexical entries that can be used 
to express the concept. This perspective motivates the fact that it is synsets, 
not lexical entries or individual senses, that participate in most of the seman¬ 
tic relations shown in Figures 16.5, 16.6, and 16.7. 

The hyponymy relations in WordNet correspond directly to the notion 
of immediate hyponymy discussed on page 599. Each synset is related to 
its immediately more general and more specific synsets via direct hypemym 
and hyponym relations. To find chains of more general or more specific 
synsets, one can simply follow a transitive chain of hypemym and hyponym 
relations. To make this concrete, consider the hypemym chains for bass 3 
and bass 7 shown in Figure 16.8. 

In this depiction of hyponymy, successively more general synsets are 
shown on successive indented lines. The first chain starts from the concept 
of a human bass singer. It’s immediate superordinate is a synset correspond¬ 
ing to the generic notion of a singer. Following this chain leads eventually 
to notions such as entertainer and person. The second chain, which starts 
from the musical instrument notion, has a completely different chain leading 
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Sense 3 

bass, basso — 

(an adult male singer with the lowest voice) 

=> singer, vocalist 

=> musician, instrumentalist, player 
=> performer, performing artist 
=> entertainer 

=> person, individual, someone... 

=> life form, organism, being... 

=> entity, something 
=> causal agent, cause, causal agency 
=> entity, something 

Sense 7 
bass — 

(the member with the lowest range of a family of 
musical instruments) 

=> musical instrument 
=> instrument 
=> device 

=> instrumentality, instrumentation 
=> artifact, artefact 

=> object, physical object 
=> entity, something 


Figure 16.8 Hyponymy chains for two separate senses of the lexeme bass. 
Note that the chains are completely distinct, only converging at entity. 


eventually such concepts as musical instrument, device and physical object. 
Both paths do eventually join at the synset entity which basically serves as a 
placeholder at the top of the hierarchy. 


16.3 The Internal Structure of Words 

The approach to meaning spelled out in the last two chapters hinged on the 
notion that there is a fundamental predicate-argument structure underlying 
our meaning representations. In composing such representations, we as¬ 
sumed that certain classes of lexemes tend to contribute the predicate and 
predicate-argument structure, while others contribute the arguments. This 
section explores in more detail the systematic ways that the meanings of lex- 
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DEEP ROLES 


THEMATIC 

ROLE 


ernes are structured to support this notion. In particular, it explores the notion 
that the meaning representations associated with lexemes have analyzable in¬ 
ternal structures, and that it is these structures, combined with a grammar, 
that determine the relations among lexemes in well-formed sentences. 

Thematic Roles 

Thematic roles, first proposed by Gruber (1965a) and Fillmore (1968) 1 are 
a set of categories which provide a shallow semantic language for charac¬ 
terizing certain arguments of verbs. For example consider the following two 
WSJ fragments: 

(16.22) Houston’s Billy Hatcher broke a bat. 

(16.23) He opened a drawer. 

In the predicate calculus event representation of Chapter 14, part of the 
representation of these two sentences would be the following: 

=le,x,y I sa(e, Breaking) A Breaker(e, Billy Hatcher) 

ABrokenThing(e , y) A Isa(y , BaseballBat) 

=le,x,y Isa(e, Opening) A Opener(e , he) 

AOpenedThing(e,y) AIsa(y,Door) 

In this representation, the roles of the subjects of the verbs break and 
open are Breaker and Opener respectively. These deep roles are specific 
to each possible kind of event; Breaking events have Breakers , Opening 
events have Openers , Eating events have Eaters , and so on. But Breakers 
and Openers have something in common. The are both volitional actors, of¬ 
ten animate, and they have direct causal responsibility for their events. A 
thematic role is a way of expressing this commonality. We say that the 
subjects of both these verbs are agents. Thus agent is the thematic role 
which represents an abstract idea such as volitional causation. Similar, the 
direct objects of both these verbs, the BrokenThing and OpenedThing , are 
both pro to typically inanimate objects which are affected in some way by the 
action. The thematic role for these participants is the theme. 

As we will discuss below, while there is no standard set of thematic 
roles, there are many roles that are commonly used by computational sys¬ 
tems. For example, in any straightforward interpretation of Example 16.24, 
Mr. Cockwell has had his collarbone broken, but there is no implication 
that he was the agent of this unfortunate event. This kind of participant 


1 Fillmore actually called them deep cases , on the metaphor of morphological case. 
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can be labeled an experienced while the directly effected participant, the 
collarbone in this case, is again assigned the theme role. 

(16.24) A company soccer game last year got so rough that Mr. Cockwell 
broke his collarbone and an associate broke an ankle. 

In Example 16.25, the earthquake is the direct cause of the glass break¬ 
ing and hence might seem to be a candidate for an agent role. This seems 
odd, however, since earthquakes are not the kind of participant that can inten¬ 
tionally do anything. Examples such as this have been the source of consid¬ 
erable debate over the years among the proponents of various thematic role 
theories. Two approaches are common: assign the earthquake to the AGENT 
role and assume that the intended meaning has some kind of metaphorical 
connection to the core animate/volitional meaning of agent, or add a role 
called FORCE that is similar to AGENT but lacks any notion of volitionality. 
We will follow this latter approach and return to the notion of metaphor in 
Section 16.4. 

(16.25) The quake broke glass in several downtown skyscrapers. 

Finally, in Example 16.26, the subject (it) refers to an event participant 
(in this case, someone else’s elbow) whose role in the breaking event is as 
the instrument of some other agent or force. Such participants are called 
INSTRUMENTS. 

(16.26) It broke his jaw. 

Figure 16.9 presents a small list of commonly-used thematic roles 
along with a rough description of the meaning of each. Figure 16.10 pro¬ 
vides representative examples of each of role. Note that this list of roles 
is by no means definitive, and does not correspond to any single theory of 
thematic roles. 

Applications to Linking Theory and Shallow Semantic Interpretations 

One common use thematic roles in computational systems is as a shallow 
semantic language. For example, as Chapter 21 will describe, thematic roles 
are sometimes used in machine translation systems as part of a useful inter¬ 
mediate language. 

Another use of thematic roles, which was part of their original moti¬ 
vation in Fillmore (1968), was as an intermediary between semantic roles in 
conceptual structure or common-sense knowledge like Breaker and Driven- 
Thing and their more language-specific surface grammatical realization as 
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Thematic Role 

Definition 

AGENT 

EXPERIENCER 

FORCE 

THEME 

RESULT 

INSTRUMENT 

BENEFICIARY 

SOURCE 

GOAL 

The volitional causer of an event 

The experiencer of an event 

The non-volitional causer of the event 

The participant most directly affected by an event 

The end product of an event 

An instrument used in an event 

The beneficiary of an event 

The origin of the object of a transfer event 

The destination of an object of a transfer event 

Figure 16.9 Some commonly-used thematic roles with their definitions. 


Thematic Role 

Example 

AGENT 

The waiter spilled the soup 

EXPERIENCER 

John has a headache 

FORCE 

The wind blows debris from the mall into our yards 

THEME 

Only after Benjamin Franklin broke the ice ... 

RESULT 

The French government has built a regulation-size base¬ 
ball diamond... 

INSTRUMENT 

He turned to poaching catfish, stunning them with a shock¬ 
ing device 

BENEFICIARY 

Whenever Ann Callahan makes hotel reservations for her 
boss... 

SOURCE 

I flew in from Boston. 

GOAL 

I drove to Portland. 

Figure 16.10 

Prototypical examples of various thematic roles. 


subject and object. Fillmore noted that there are prototypical patterns gov¬ 
erning which argument of a verb will become the subject of an active sen¬ 
tence, proposing the following hierarchy (often now called a thematic hier- 
~hy archy (Jackendoff, 1972)) for assigning the subject role: 

AGENT >- INSTRUMENT >- THEME 

Thus if the thematic description of a verb includes an agent, an in¬ 
strument, and a theme, it is the agent which will be realized as the 
subject. If the thematic description only includes an instrument and a 
THEME, it is the INSTRUMENT which will become the subject. The thematic 
hierarchy is used in reverse for determining the direct object of active sen¬ 
tences, or the subject of passive sentences. Here are examples from Fillmore 
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(1968) using the verb open : 

(16.21) John opened the door. 

Agent Theme 

(16.28) John opened the door with the key. 

Agent Theme Instrument 

(16.29) The key opened the door. 

Agent Theme 

(16.30) The door was opened by John. 

Theme Agent 

This approach led to a wide variety of work over the last thirty years 
on the mapping between conceptual structure and grammatical function, in 
an area generally referred to as linking theory. For example many scholars theory 
such as Talmy (1985), Jackendoff (1983b), and Levin (1993) show that se¬ 
mantic properties of verbs help predict which surface alternations they can fioNs NA ' 
take. An alternation is a set of different mappings of conceptual (deep) roles 
to grammatical function. For example Fillmore (1965) and very many subse¬ 
quent researchers have studied the dative alternation, the phenomenon that alternation 
certain verbs like give, send , or read which can take an Agent, a Theme, 
and a Goal, allow the Theme to appear as object and the Goal in a prepo¬ 
sitional phrase (as in 16.31a), or the Goal to appear as the object, and the 
Theme as a sort of ‘second object’ (as in 16.31b): 

(16.31) a .Doris gave/sent/read the book to Cary. 

Agent Theme Goal 

b. Doris gave/sent/read Cary the book. 

Agent Goal Theme 

Many scholars, including Green (1974), Pinker (1989), Gropen et al. 

(1989), Goldberg (1995) and Levin (1993) (see Levin (1993, p. 45) for a full 
bibliography), have argued this alternation occurs with particular semantic 
classes of verbs, including (from Levin) ‘verbs of future having’ (advance, 
allocate, offer, owe), ‘send verbs’ (forward, hand, mail), ‘verbs of throwing’ 

(kick, pass, throw, and many other classes. 

Similarly, Talmy (1985), following Lakoff (1965, p.126), shows that 
‘affect’ verbs such as frighten, please, and exasperate can appear with the 
Theme as subject, as in (16.32), or with the Experiencer as subject and 
the Theme as a prepositional object, as in (16.33): 

(16.32) a. That frightens me. 

Theme Experiencer 

b. That interests me. 

Theme Experiencer 
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c. That surprises me. 

Theme Agent 

(16.33) a. I am frightened of that. 

Experiencer Theme 

b. I am interested in that. 

Experiencer Theme 

c. I am surprised at that. 

Experiencer Theme 

Levin (1993) summarizes 80 of these alternations, including extensive 
lists of the verbs in each semantic class, together with the semantic con¬ 
straints, exceptions, and other idiosyncracies. This list has been used in a 
number of computational models (e.g. Dang et a/., 1998; Jing and McKe- 
own, 1998) 

While research of the type summarized above has shown a relation 
between verbal semantic and syntactic realization, it is less clear that this re¬ 
lation is mediated by a small set of thematic roles, with or without a thematic 
hierarchy. For example, it turns out that semantic classes are insufficient to 
define the set of verbs that participate in an alternation. For example many 
verbs do not allow the dative alternation despite being in the proper semantic 
class (e.g. donate , return , transfer). In addition, as shown above, many of 
the verbal alternations violate any standard thematic hierarchy (dative alter¬ 
nation sentences like Ling sent Mary the book have a Goal as direct object 
followed by an oblique Theme, when Theme should be the best direct ob¬ 
ject). Furthermore, arguments about the appropriate set of thematic roles 
are legion. But an even greater problem is that thematic roles, however they 
are defined, could only play a very small role in the general mapping from 
semantics to syntax. This is because thematic roles are only relevant to de¬ 
termining the grammatical role of NP and PP arguments, and play no part 
in the realization of other arguments of verbs and other predicates. Many 
such possible arguments were described in Figure 11.3 on page 411, such as 
sentential complements (Sfin, Swh-, Sforto), verb phrases (YPbrst, YPto, 
etc), or quotations (Quo). Furthermore, thematic roles only are useful in 
mapping the arguments of verbs; but nouns, for example, have arguments as 
well (destruction of the city, father of the bride) . 

There are a number of possible responses to these problems with the¬ 
matic roles. Many systems continue to use them for such practical purposes 
as interlinguas in machine translation or as a convenient level of shallow 
semantic interpretation. Other researchers have argued that thematic roles 
should be considered an epiphenomenon, rather than a distinct represen- 
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tational level. For example following Foley and van Valin (1984), Dowty 
(1991) argues that rather than a discrete set of thematic roles there are only 
two cluster-concepts, Proto-Agent and Proto-Patient. Determining 
whether an argument of a verb is a Proto-Agent is predictable from the 
entailments of the deep conceptual structure meaning of the verb. The map¬ 
ping from semantic role in conceptual structure to grammatical function pro¬ 
ceeds via simple rules (the most Proto-AgENT- like of the arguments is the 
subject, the most Proto-PAT iENT-like is the object (or the subject of the 
passive construction)). Dowty’s two rules make direct reference to the deep 
conceptual structure of the verb; thus thematic roles do not appear at any 
representational level at all. 

One problem with Dowty’s model is that the choice of thematic roles 
is not always predictable from the underlying conceptual structure of the 
event and its participants. For example Fillmore (1977) pointed out that 
the different verbs which can describe a commercial event each choose a 
different way to map the participants of the event. For example, a transaction 
between Amie and Benson involving three dollars and a sandwich can be 
described in any of these ways: 

(16.34) a. Amie bought the sandwich from Benson for three dollars. 

b. Benson sold Amie the sandwich for three dollars. 

c. Amie paid Benson three dollars for the sandwich. 

Each of these verbs buy , sell , and pay , chooses a different perspective 
on the commercial event, and realizes this perspective by choosing a different 
mapping of underlying participants to thematic roles. The fact that these 
three verbs have very different mappings suggests that the thematic roles for 
a verb must be listed in the lexical entry for the verb, and are not predictable 
from the underlying conceptual structure. 

This fact, together with the fact mentioned earlier that verb alternations 
are not completely predictable semantically (e.g. exceptions like donate ) has 
led many researchers to assume that any useful computational lexicon needs 
to list for each verb (or adjective or other predicate) its syntactic and the¬ 
matic combinatory possibilities. Another advantage of listing the combina¬ 
tory possibilities for each verb is that the probability of each thematic frame 
can also be listed. 

One recent attempt to list these elements for a number of predicates of 
English is the framenet project (Baker et al , 1998; Lowe et ah, 1997). A 
framenet entry for a word lists every set of arguments it can take, includ¬ 
ing the possible sets of thematic roles, syntactic phrases, and their grammat- 
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ical function. The thematic roles used in framenet are much more specific 
than the 9 examples we’ve been describing. Each framenet thematic role 
is defined as part of a frame, and each frame as part of a domain. For exam¬ 
ple the Cognition domain has frames like static cognition ( believe, think, 
understand, etc), cogitation {brood, ruminate ), judgment, {accuse, admire, 
rebuke ), etc. All of the cognition frames define the thematic role cognizer. 
In the judgment frame, the cognizer is referred to as the judge; the frame 
also includes an Evaluee, a Reason, and a Role; here are some examples 
from (Johnson, 1998): 

Judge Kim respects Pat for being so brave 
Evaluee Kim respects Pat for being so brave 
Reason Kim respects Pat for being so brave 
Role Kim respects Pat as a scholar 


Each entry is also labeled by one of the phrase types described in 
Figure 11.3 on page 411, and by a grammatical function (subject, object, 
or complement). For example, here is part of the framenet entry for the 
judgment verb appreciate ; we have shown only the active senses of the verb; 
the full entry includes passives as well. Example sentences are (sometimes 
shortened) from the British National Corpus: 

(16.35) a. Judge Reason Evaluee 

NP/Subj NP/Obj PP(in)/Comp 

I still appreciate good manners in men. 

b. Judge Evaluee Reason 

NP/Subj NP/Obj PP(for)/Comp 

I could appreciate it for the music alone. 

c. Judge Reason 


NP/Subj NP/Obj 

I appreciate your kindness 

d. Judge Evaluee Role 

NP/Subj NP/Obj PP(for)/Comp 

He did not appreciate the artist as a dissenting voice. 


By contrast, another sense of the verb appreciate is as a verb of static 
cognition like understand; verbs of static cognition have roles like Cog¬ 
nizer and Content; here are some examples: 

(16.36) a. Cognizer Content 

NP/Subj Sfin/Comp 

They appreciate that communication is a two-way process, 

b. Cognizer Content 

NP/Subj Swh-/Comp 

She appreciated how far she had fallen from grace. 
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It should be clear from examining the example sentences that some 
generalizations can be drawn about the realization of different thematic roles. 

Judges, Cognizers, and Agents in general are often realized as subjects 
of active sentences. Roles are often realized as PPs with the preposition 
as. Content is often realized as some kind of S. Representing thematic 
roles at this fine-grained level may thus make the mapping to syntax more 
transparent. The problem with a scheme like Framenet is the extensive 
human effort it requires in defining thematic roles for each domain and each 
frame. 

Selection Restrictions 

The notion of a selection restriction can be used to augment thematic roles rIstrIction 
by allowing lexemes to place certain semantic restrictions on the lexemes and 
phrases that can accompany them in a sentence. More specifically, a selec¬ 
tion restriction is a semantic constraint imposed by a lexeme on the concepts 
that can fill the various argument roles associated with it. As with many 
other kinds of linguistic constraints, selection restrictions can most easily 
be observed in situations where they are violated. Consider the following 
example originally discussed in Chapter 14. 

(16.37) I wanna eat someplace that’s close to icsi. 

There are two possible parses for this sentence corresponding to the intransi¬ 
tive and transitive versions of the verb eat. These two parses lead, in turn, to 
two distinct semantic analyses. In the intransitive case, the phrase someplace 
that's close to icsi is an adjunct that modifies the event specified by the verb 
phrase, while in the transitive case it provides a true argument to the eating 
event. This latter case is similar in structure and interpretation to examples 
such as the following, where the noun phrase specifies the thing to be eaten. 

(16.38) I wanna eat some really cheap Chinese food right now. 

Not surprisingly, attempting to analyze Example 16.37 along these 
lines results in a kind of semantic ill-formedness. This ill-formedness signals 
the presence of a selection restriction imposed by eat on its patient role: it 
has to be something that is edible. Since the phrase being proposed as the 
patient in this scenario can not easily be interpreted as edible, the inter¬ 
pretation exhibits the semantic analog of syntactic ungrammaticality. This 
particular variety of ill-formedness arises from what is known as a selec¬ 
tion restriction violation: a situation where the semantics of the filler of a restrIction 

VIOLATION 



614 


Chapter 16. Lexical Semantics 


thematic role is not consistent with a constraint imposed on the role by the 
predicate. 

This rather informal description of selection restrictions needs to be 
refined in a number of ways before it can be put to practical use. The first 
refinement concerns the proper locus for stating the selection restrictions. 
As discussed Sectionl6.1, lexemes are often associated with a wide variety 
of different senses and, not surprisingly, these senses can enforce differing 
constraints on their arguments. Selection restrictions therefore are associated 
with particular senses, not entire lexemes. Consider the following examples 
of the lexeme serve. 

(16.39) Well, there was the time they served green-lipped mussels from 
New Zealand. 

(16.40) Which airlines serve Denver? 

(16.41) Which ones serve breakfast? 

Example 16.39 illustrates the cooking sense of serve , which ordinarily re¬ 
stricts its patient to be some kind foodstuff. Example 16.40 illustrates the 
provides a commercial service to sense of serve, which constrains its pa¬ 
tient to be some type of identifiable geographic or political entity. The 
sense shown in the third example is closely related to the first, and illustrates 
a sense of serve that is restricted to specifications of particular meals. These 
differing restrictions on the same thematic role of a polysemous lexeme can 
be accommodated by associating them with distinct senses of the same lex¬ 
eme. As we will discuss in Chapter 17, this strongly suggests that selection 
restrictions can be used to discriminate these senses in context. 

Note that the selection restrictions imposed by different lexemes, and 
different senses of the same lexeme, may occur at widely varying levels 
of specificity, with some lexemes expressing very general conceptual cat¬ 
egories, and others expressing very specific ones indeed. Consider the fol¬ 
lowing examples of the verbs imagine , lift and diagonalize. 

(16.42) In rehearsal, I often ask the musicians to imagine a tennis game. 

(16.43) Others tell of jumping over beds and couches they can’t imagine 
clearing while awake. 

(16.44) I cannot even imagine what this lady does all day. 

(16.45) Atlantis lifted Galileo from the launch pad at 12:54 p.m. EDT and 
released the craft from its cargo bay about six hours later. 
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(16.46) When the battle was over, Mr. Kruger lifted the fish from the water, 
gently removed the hook from its jaw, admired it, and eased it back 
into the lake. 

(16.47) To diagonalize a matrix, is to find its eigenvalues. 

Given the meaning of imagine , it is not surprising to find that it places few 
semantic restrictions on the concepts that can fill its patient role. Its agent 
role, on the other hand, is restricted to humans and other animate entities. 
In contrast, the sense of lift shown in Examples 16.45 and 16.46 limits its 
patient to be something liftable, which as these examples illustrate is a 
notion that must cover both spacecraft and fish. For all practical purposes, 
this notion is best captured by the fairly general notion such as physical 
object. Finally, we have diagonalize which imposes a very specific constraint 
on the filler of its patient role: it has to be a matrix. 

These examples serve to illustrate an important fact about selection re¬ 
strictions: the concepts, categories, and features that are deployed by the 
lexicon as selection restrictions are not a part of the finite language capac¬ 
ity. Rather, they are as open-ended as the lexicon itself. This distinguishes 
selection restrictions from some of the other finite features of language that 
are used to define lexemes including parts-of-speech, thematic roles, and se¬ 
mantic primitives. 

Before we move on, it is worth pointing out that verbs are not the 
only part-of-speech that can impose selection restrictions on their arguments. 
Rather, it appears to be the case that any predicate-bearing lexeme can im¬ 
pose arbitrary semantic constraints on the concepts that fill its argument 
roles. Consider the following examples, which illustrate the selection re¬ 
strictions associated with some non-verb parts-of-speech. 

(16.48) Radon is a naturally occurring odorless, tasteless gas that can’t be 
detected by human senses. 

(16.49) What is the lowest fare for United Airlines flight four thirty? 

(16.50) Are there any restaurants open after midnight? 

The adjectives odorless and tasteless in 16.48 are restricted to concepts that 
can possess an odor or a taste. Similarly, as we discussed earlier in Section 
16.1, the noun fare is restricted to various forms of public transportation. Fi¬ 
nally, arguments to the preposition after must directly or indirectly designate 
points in time. 



616 


Chapter 16. Lexical Semantics 


Representing Selection Restrictions 

The semantics of selection restrictions can be captured in a straightforward 
way by extending the event-oriented meaning representations employed in 
Chapter 14. Recall that the representation of an event consists of a single 
variable that stands for the event, a predicate that denotes the kind of event, 
and a series of variables and relations that designate the roles associated 
with the event. Ignoring the issue of the ^-structures, and using thematic 
roles rather than deep event roles, the semantic contribution of a verb like 
eat might look like the following. 

3e,x,y Eatingfe) AAgent(e,x) /\Patient(e,y) 

With this representation, all we know about y, the filler of the Patient role, 
is that it is associated with an Eating event via the Patient relation. To 
stipulate the selection restriction that y must be something edible, we simply 
add a new term to that effect, as in the following. 

3e,x,y Eating(e) AEater(e,x) APatient(e,y) AIsa(y,EdibleThing) 

When a phrase like ate a hamburger is encountered, a semantic analyzer can 
form the following kind of representation. 

3e,x,y Eating(e) AEater(e,x) APatient(e,y) A Isa(y, EdibleThing) 
AIsa{y , Hamburger) 

This representation is perfectly reasonable since the membership of y in 
the category Hamburger is consistent with its membership in the category 
EdibleThing , assuming a reasonable set of facts in the knowledge base. Cor¬ 
respondingly, the representation for a phrase such as ate a takeoff would be 
ill-formed because membership in an event-like category such as Takeoff 
would be inconsistent with membership in the category EdibleThing . 

While this approach adequately captures the semantics of selection re¬ 
strictions, there are two practical problems with its direct use. First, using 
the full power of First Order Logic to perform the simple task of enforcing 
selection restrictions is overkill. There are far simpler formalisms that can 
do the job with far less computational cost. The second problem is that it 
presupposes a large logical knowledge-base of facts about the concepts that 
make up selection restrictions. Unfortunately, although such common sense 
knowledge-bases are being developed, none are widely available and few 
have the kind of scope necessary to the task. 

A far more practical approach, at least for English, is to exploit the 
hyponymy relations present in the WordNet database. In this approach, se¬ 
lection restrictions on semantic roles are stated in terms of WordNet synsets, 
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Sense 1 

hamburger, beefburger — 

(a fried cake of minced beef served on a bun) 

=> sandwich 

=> snack food 
=> dish 

=> nutriment, nourishment, sustenance... 
=> food, nutrient 

=> substance, matter 

=> object, physical object 
=> entity, something 


Figure 16.11 Evidence from WordNet that hamburgers are edible. 


rather than logical concepts. A given meaning representation can be judged 
to be well-formed if the lexeme that fills a thematic role has as one of its 
hypemyms, the synset specified by the predicate for that thematic role. Con¬ 
sider how this approach would work with our ate a hamburger example. 
Among its 60,000 synsets, WordNet includes the following one, which is 
glossed as any substance that can be metabolized by an organism to give 
energy and build tissue. 

{food, nutrient} 

Given this synset, we can specify it as the selection restriction on the pa¬ 
tient role of the verb eat , thus limiting fillers of this role to lexemes in this 
synset and its hyponyms. Luckily, the chain of hypemyms for hamburger 
shown in Figure 16.3, reveals that that hamburgers are indeed food. 

Note that in this approach, the filler of a role does not have to match 
the restriction synset exactly. Rather, a selection restriction is satisfied if the 
filler has the restricting synset as one of its eventual hypemyms. Thus in the 
hamburger example, the selection restriction synset is found five hypemym 
levels up from hamburger. 

Of course, this approach also allows individual lexemes to satisfy re¬ 
strictions at varying levels of specificity. For example, consider what hap¬ 
pens when we apply this approach to the patient roles of the verbs imagine , 
lift and diagonalize , discussed earlier. Let us restrict imagined patient to 
the synset {entity, something}, lift’s patient to {object, physical object} 
and diagonalize to {matrix}. This arrangement correctly permits imagine a 
hamburger and lift a hamburger , while also correctly ruling out diagonalize 
a hamburger. 
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Note that this approach relies on the presence in WordNet of exactly 
those lexemes that specify exactly the concepts needed for all possible se¬ 
lection restrictions. Unfortunately, there is no particular reason to believe 
that the set of concepts used as selection restrictions in a language is exactly 
subsumed by the lexemes in the language. This situation is accommodated 
to some extent in WordNet through the use of collocations such as physical 
object and snack food. 

To address this problem more directly, there are a number of linguistically- 
oriented taxonomies that sit somewhere between common sense knowledge¬ 
bases such as CYC, and lexical databases such WordNet. The objects con¬ 
tained in these hybrid models do not have to correspond to individual lexical 
items, but rather to those concepts that are known to be grammatically and 
lexically relevant. In most cases, the upper portions of these taxonomies are 
taken to represent domain and language-independent notions, such as phys¬ 
ical objects, states, events and animacy. One of the most well-developed of 
these ontologies is the the penman Upper Model, discussed in more detail 
in Chapter 20. 

Primitive Decomposition 

The theories of meaning representation presented here, and in the last few 
chapters, have had a decidedly lexical flavor. The meaning representations 
for sentences have been composed of atomic symbols that appear to cor¬ 
respond very closely to individual lexemes. However, other than thematic 
roles, these lexical representations have had not much of an internal struc¬ 
ture. The notion of primitive decomposition, or componential analysis, is 
an attempt to supply such a structure. 

To explore these notions, consider the following examples motivated 
by the discussion in McCawley (1968). 

(16.51) Jim killed his philodendren. 

(16.52) Jim did something to cause his philodendren to become not alive. 

One can make an argument that these two sentences mean the same thing. 
However, this is not case of synonymy, since kill is not synonymous with any 
individual lexemes in 16.52. Instead, one can think of kill as being equivalent 
to the particular configuration of more fundamental elements found in the 
second sentence. 

Taking this to the next logical step, we can invoke the notion of canon¬ 
ical form and say that these two examples should have the same meaning 
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representation — the one underlying Example 16.52. Translating a simple 
predicate like kill into a more complex set of predicates can be viewed as 
breaking down, or decomposing, the meaning of words into combinations of 
simpler, more primitive, parts. In this example, the more primitive, possibly 
atomic, parts are the meaning representations associated with the lexemes 
cause , become not , and alive. 

While many such primitive sets of have been proposed, the approach 
known as Conceptual Dependency (cd) (Schank, 1972) has been the most 
widely used primitive-based representational system within natural language 
processing. In this approach, eleven primitive predicates are used to repre¬ 
sent all predicate-like language expressions. Figure 16.12 shows the eleven 
primitives with a brief explanation of their meaning. 

As an example of this approach, consider the following sentence along 
with its CD representation. 

(16.53) The waiter brought Mary the check. 

3x,y Atrans(x) A Actor(x, Waiter) A Object{x , Check) A To(x,Mary) 
A Ptrans(y) A Actor(y, Waiter) A Object(y , Check) A To(y,Mary) 
Here, the verb brought is translated into the two primitives atrans and 
ptrans to indicate the fact that the waiter both physically conveyed the 
check to Mary and passed control of it to here. Note that CD also associates 
a fixed set of thematic roles with each primitive to represent the various 
participants in the action. 

Note that, in general, the compositional approach need not be limited to 
the meanings of verbs. The same notion can be used to decompose nominals 
into more primitive notions. Consider the following decompositions of the 
lexemes kitten , puppy , and child into more primitive elements. 

3xlsa(x, Feline) A Isa(x, Youth) 

3xlsa(x , Canine) A 1sa(x. Youth) 

3xlsa(x , Human) A fsa(x, Youth) 

Here the primitives represent more primitive categories of objects, rather 
than actions. Using these primitives, the close relationship between these 
lexemes and the related terms cat , dog and person can then be captured with 
the following similar formulas. 

3xlsa(x, Feline) AIsa(x,Adult) 

3xlsa(x , Canine) A Isa(x, Adult) 

3xlsa(x, Human) A Isa(x, Adult) 

The primary applications of primitives in natural language processing 
have been in semantic analysis and in machine translation. In semantic anal- 
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SEMANTIC 

FIELD 


Primitive 

Definition 

Atrans 

The abstract transfer of possession or control from one en¬ 
tity to another. 

Ptrans 

The physical transfer of an object from one location to 
another 

Mtrans 

The transfer of mental concepts between entities or within 
an entity. 

Mbuild 

The creation of new information within an entity. 

Propel 

The application of physical force to move an object. 

Move 

The integral movement of a body part by an animal. 

Ingest 

The taking in of a substance by an animal. 

Expel 

The expulsion of something from an animal. 

Speak 

The action of producing a sound. 

Attend 

The action of focusing a sense organ. 

Figure 16.12 

A set of conceptual dependency primitives 


ysis, the principle use has been in organizing the inference process. Instead 
of having to encode thousands of idiosyncratic meaning postulates with par¬ 
ticular lexical items, inference rules can be associated with a small number 
of primitives. We should note the use of primitive decomposition in the rep¬ 
resentation on nominals has largely been supplanted by the use of inheritance 
hierarchies. As we will see in Chapter 21, the emphasis in machine trans¬ 
lation has been on the use of primitives as language independent meaning 
representations, or interlinguas. 

Semantic Fields 

The lexical relations described in Section 16.1 had a decidedly local char¬ 
acter, and made no use of the internal structure of the lexemes taking part 
in the relation. The notion of a semantic field is an attempt to capture a 
more integrated, or wholistic, relationship among entire sets of words from a 
single domain. Consider the following set of words extracted from the ATIS 
corpus. 

reservation, flight, travel buy , price, cost, fare, rates, meal, plane 

It is certainly possible to assert individual lexical relations between 
many of the lexemes in this list. The resulting set of relations does not, how¬ 
ever, add up to a complete account of how these lexemes are related. They 
are clearly all defined with respect to a coherent chunk of common sense 
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background information concerning air travel. Background knowledge of 
this kind has been studied under a variety of frameworks and is known var¬ 
iously as a frame (Fillmore, 1985), model (Johnson-Laird, 1983), or script 
(Schank and Albelson, 1977), and plays a central role in a number of com¬ 
putational frameworks, some of which will be discussed in Chapter 18. 

The FrameNet project (Baker et al , 1998) is a recent attempt to pro- framenet 
vide a robust resource for this kind of knowledge. In FrameNet, lexemes that 
refer to actions, events, thematic roles, and objects belonging to a particular 
domain are linked to concepts contained in frames that represent that partic¬ 
ular domain. As in most current ontology efforts, these frames are arranged 
in a hierarchy so that specific frames can inherit roles from more abstract 
frames. The current FrameNet effort is directed at the creation of several 
thousand frame-semantic lexical entries. The domains to be covered in¬ 
clude: HEALTH CARE, CHANCE, PERCEPTION, COMMUNICATION, TRANS¬ 
ACTION, TIME, SPACE, BODY, MOTION, LIFE STAGES, SOCIAL CONTEXT, 

and cognition. 

16.4 Creativity and the Lexicon 

The approach we have presented thus far views the lexicon as a static repos¬ 
itory from which meaning representations are retrieved as needed. A more 
realistic alternative view holds that the lexicon is closer to a generative de¬ 
vice than a static repository. Rather than simply retrieving static senses, the 
lexicon generates meaning components appropriate to each situation on de¬ 
mand. Under this view, much of the apparent polysemy in the lexicon is due 
to this generative capacity. This capacity is, of course, not unlimited or un¬ 
systematic. Rather, it is governed by a number of productive, or generative, 
models that can systematically combine lexical, grammatical, contextual, 
and common sense knowledge to create the novel meanings we see every 
day. 

To make this discussion more concrete, consider the following sen¬ 
tence from the WSJ corpus. 

(16.54) That doesn’t scare Digital, which has grown to be the world’s 

second-largest computer maker by poaching customers of IBM’s 

mid-range machines. 

Let’s consider the meanings of scare and poach in this example. The verb 
scare in WordNet has two closely related senses: to cause fear in, and to 
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cause to lose courage. Although it might be interesting to consider which 
of these senses is the right one for this example, its even more interesting to 
consider what it would mean for a corporation to lose courage, or even to 
have it in the first place. For this sentence to make sense, it would appear to 
be the case that corporations must be able to experience emotions like fear 
or courage. Of course, they don’t but we certainly speak of them and often 
reason about them as if they do. 

The verb poach in WordNet has a cooking by boiling sense, and a il¬ 
legal taking of game sense. Intuitively, the use of poach in this example is 
closer to the illegal taking meaning than the boiling one. Of course, this is 
clearly not a simple instance of this use; the poaching involved is not illegal, 
and we can only hope that the poached things are not being killed. In this 
case, the customers are being viewed as a kind of property belonging to the 
company they do business with; and when they choose to do business with 
another company they have been stolen. 

This ability to talk about, and reason about, concepts in terms of other 
metaphor distinct kinds of concepts is called metaphor and is pervasive in all lan¬ 
guages. As a generative model, it is responsible for a large proportion of 
the polysemy in the language, including many of the senses that are listed in 
dictionaries as well as the more novel ones that are not. 

Let’s now consider the following example from the WSJ. 

(16.55) GM killed the Fiero because it had dedicated a full-scale factory 

to... 

The use of kill in this example roughly means to put an end to some kind 
of ongoing effort, or activity. In this case, the ongoing activity of building, 
marketing, and selling a particular kind of car. The metaphor underlying this 
use views activities as living things, allowing the termination to be viewed 
as a killing. Note, however, that this sentence does not say any of this. In 
particular, the patient of the killing is a definite reference the Fiero . For 
the metaphor to make sense, this phrase must refer not to a particular car, but 
rather to an entire sales and production effort at GM. At a very high level, 
this is a case where the result of an entire effort, or process, is being used 
to refer to the process itself. This is an example of metonymy, referring 
to a concept by mentioning a concept closely related to it. Like metaphor, 
metonymy is pervasive and goes mostly unnoticed in natural settings. 
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16.5 Summary 

This chapter has covered a wide range of issues concerning the meanings 
associated with lexical items. The following are among the highlights: 

• Lexical semantics is the study of the systematic meaning-related con¬ 
nections among lexemes, and the internal meaning-related structure of 
individual lexemes. 

• Homonymy refers to lexemes with the same form but unrelated mean¬ 
ings. 

• Polysemy refers to the notion of a single lexeme with multiple related 
meanings. 

• Synonymy holds between different lexemes with the same meaning. 

• Hyponomy relations hold between lexemes that are in class-inclusion 
relationship. 

• Semantic fields are used to capture semantic connections among groups 
of lexemes drawn from a single domain. 

• WordNet is a large database of lexical relations for English words. 

• Thematic roles abstract away from the specifics of deep semantic roles 
by generalizing over similar roles across classes of verbs. 

• Semantic selection restrictions allow lexemes to post constraints on the 
semantic properties of the constituents that accompany them in sen¬ 
tences. 

• Primitive decomposition allows permits the representation of the mean¬ 
ings of individual lexemes in terms of finite sets of sub-lexical primi¬ 
tives. 

• Generative devices such as metaphor and metonymy are pervasive, and 
produce novel meanings that can not in principle be captured in a static 
lexicon. 


Bibliographical and Historical Notes 

Lyons (1977) and Cruse (1986) are classic linguistics texts on lexical seman¬ 
tics. Collections describing computational work on lexical semantics can be 
found in (Pustejovsky and Bergler, 1992; Saint-Dizier and Yiegas, 1995; 
Klavans, 1995). 
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Martin (1986) and Copestake and Briscoe (1995) discuss computa¬ 
tional approaches to the representation of polysemy. The most compre¬ 
hensive collection of work concerning WordNet can be found in (Fellbaum, 
1998). There have been many efforts to use existing dictionaries as lexical 
resources. One of the earliest was Amsler’s (1980, 1981) use of the Mer- 
riam Webster dictionary. More recently, the machine readable version of 
Longman’s Dictionary of Contempory English has been used in a number of 
systems (Boguraev and Briscoe, 1989). 

Thematic roles, or case roles, can be traced back to work by Fillmore 
(1968) and and (Gruber, 1965b). Fillmore’s work had an enormous and im¬ 
mediate impact on work in natural language processing. For a considerable 
period of time, nearly all work in natural language understanding used some 
version of Fillmore’s case roles. Much of the early work in this vein was due 
to Simmons (1973b, 1978, 1983). 

Work on selection restrictions as a way of characterizing semantic 
well-formedness began with (Katz and Fodor, 1963). McCawley (1968) was 
the first to point out that selection restrictions could not be restricted to a 
finite list of semantic features, but had to be drawn from a larger base of 
unrestricted world knowledge 

Lehrer (1974) is a classic text on semantic fields. More recent papers 
addressing this topic can be found in (Lehrer and Kittay, 1992). Baker et al 
(1998) describe ongoing work on the FrameNet project. 

The use of primitives, components, and features to define lexical items 
is ancient. Nida (1975) presents a comprehensive overview of work on com- 
ponential analysis. Wierzbecka (Wierzbicka, 1996) has long been a major 
advocate of the use of primitives in linguistic semantics. Another promi¬ 
nent effort has been Jackendoff’s Conceptual Semantics (Jackendoff, 1983a, 
1990) work which combines thematic roles and primitive decomposition. On 
the computational side, Schank’s Conceptual Dependency Schank (1972) re¬ 
mains the most widely used set of primitives in natural language processing. 
Wilks (1975a) was an early promoter of the use of primitives in machine 
translation, as well natural language understanding in general. More re¬ 
cently, Dorr (1993, 1992) has made considerable computational use of Jack¬ 
endoff’s framework in her work on machine translation. 

An influential collection of papers on metaphor can be found in (Ortony, 
1993). Lakoff and Johnson (1980) is the classic work on conceptual metaphor 
and metonymy. Pustejovsky (1995) introduced the notion of the Gener¬ 
ative Lexicon , a conceptual framework that rejects the notion of the lexi¬ 
con as a static repository in favor of a more dynamic view. Russell (1976) 



Section 16.5. Summary 


625 


presents one of the earliest computational approach to metaphor. Additional 
early work can be found in (DeJong and Waltz, 1983; Wilks, 1978; Hobbs, 
1979b). More recent computational efforts to analyze metaphor can be found 
in (Fass, 1988, 1991; Martin, 1990; Veale and Keane, 1992; Iverson and 
Helmreich, 1992; Chandler, 1991). Martin (1996) presents a survey of com¬ 
putational approaches to metaphor and other types of figurative language. 


Exercises 

16.1 Collect three definitions of ordinary non-technical English words from 
a dictionary of your choice that you feel are flawed in some way. Explain 
the nature of the flaw and how it might be remedied. 

16.2 Download and install the current version of WordNet. 

16.3 Give a detailed account of similarities and differences among the fol¬ 
lowing set of lexemes: imitation , synthetic , artificial, fake and simulated. 

Examine the entries for these lexemes in WordNet (or some dictionary 
of your choice). How well does it reflect your analysis? 

16.4 Consider the following examples from (McCawley, 1968). 

My neighbor is a father of three. 

?My buxom neighbor is a father of three. 

What does the ill-formedness of the second example imply about how 
constituents satisfy, or violate, selection restrictions? 

16.5 Find some articles about business, sports, or politics from your daily 
newspaper. Identify as many lexical metaphors and metonymies as you can 
in these articles. How many of these uses have reasonably close entries in 
either WordNet or your favorite dictionary? 

16.6 [more to come ] 





WORD SENSE 
DISAMBIGUATION AND 
INFORMATION 
RETRIEVAL 


Oh are you from Wales? 

Do you know a fella named Jonah? 
He used to live in whales for a while . 
Groucho Marx 


This chapter introduces a number of topics related to lexical semantic 
processing. By this, we have in mind applications that make use of word 
meanings, but which are to varying degrees decoupled from the more com¬ 
plex tasks of compositional sentence analysis and discourse understanding. 

The first topic we cover, word sense disambiguation, is of consider¬ 
able theoretical and practical interest. As we noted in Chapter 16, the task of 
word sense disambiguation is to examine word tokens in context and spec¬ 
ify which sense of each word is being used. As we will see in the next two 
sections, making this vague definition operational is a non-trivial — there is 
no clear consensus as to exactly what the task is, or how it should be evalu¬ 
ated. Nevertheless, there are robust algorithms that can achieve high levels 
of accuracy under certain reasonable assumptions. 

The second topic we cover, information retrieval, is an extremely 
broad field, encompassing a wide-range of topics pertaining to the storage, 
analysis, and retrieval of all manner of media (Baeza-Yates and Ribeiro- 
Neto, 1999). Our concern in this chapter is solely with the storage and re¬ 
trieval of text documents in response to users requests for information. We 
are interested in approaches in which users’ needs are expressed as words, 
and documents are represented in terms of the words they contain. Section 
17.3 presents the vector space model, a well-established approach used in 
most current systems, including most Web search engines. 


SEMANTIC 

PROCESSING 


WORD SENSE 
DISAMBIGUA¬ 
TION 


INFORMATION 

RETRIEVAL 




628 


Chapter 17. Word Sense Disambiguation and Information Retrieval 


17.1 Selection Restriction-Based Disambiguation 

For the most part, our discussions of compositional semantic analyzers in 
Chapter 15 ignored the issue of lexical ambiguity. By now it should be clear 
that this is not a reasonable approach. Without some means of selecting cor¬ 
rect senses for the words in the input, the enormous amount of homonymy 
and polysemy in the lexicon will quickly overwhelm any approach in an 
avalanche of competing interpretations. As with syntactic part-of-speech 
tagging, there are two fundamental approaches to handling this ambiguity 
problem. In the first approach, the selection of correct senses occurs during 
semantic analysis as a side-effect of the elimination of ill-formed represen¬ 
tations composed from an incorrect combination of senses. In the second 
approach, sense disambiguation is performed as a stand-alone task indepen¬ 
dent of, and prior to, compositional semantic analysis. This section discusses 
the role of selection restrictions in the former approach. The stand-alone ap¬ 
proach is discussed in detail in 17.2. 

Selection restrictions and type hierarchies are the primary knowledge- 
sources used to perform disambiguation in most integrated approaches. In 
particular, they are used to rule out inappropriate senses and thereby reduce 
the amount of ambiguity present during semantic analysis. If we assume 
an integrated rule-to-rule approach to semantic analysis, then selection re¬ 
strictions can be used to block the formation of component meaning repre¬ 
sentations that contain violations. By blocking such ill-formed components, 
the semantic analyzer will find itself dealing with fewer ambiguous meaning 
representations. This ability to focus on correct senses by eliminating flawed 
representations that result from incorrect senses can be viewed as a form of 
indirect word sense disambiguation. While the linguistic basis for this ap¬ 
proach can be traced back to the work of Katz and Fodor (1963), the most 
sophisticated computational exploration of it is due to Hirst (1987). 

As an example of this approach, consider the following pair of WSJ 
examples, focusing solely on their use of the lexeme dish. 

(17.1) “In our house, everybody has a career and none of them includes 
washing dishes”, he says. 

(17.2) In her tiny kitchen at home, Ms. Chen works efficiently, stir-frying 
several simple dishes, including braised pig’s ears and chicken livers 
with green peppers. 

These examples make use of two polysemous senses of the lexeme dish. The 
first refers to the physical objects that we eat from, while the second refers to 
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the actual meals or recipes. The fact that we perceive no ambiguity in these 
examples can be attributed to the selection restrictions imposed by wash and 
stir-fry on their patient roles, along with the semantic type information 
associated with the two senses of dish. More specifically, the restrictions 
imposed by wash conflict with the food sense of dish since it does not denote 
something that is normally washable. Similarly, the restrictions on stir-fry 
conflict with the artifact sense of dish, since it does not denote something 
edible. Therefore, in both of these cases the predicate selects the correct 
sense of an ambiguous argument by eliminating the sense that fails to match 
one of its selection restrictions. 

Now consider the following WSJ and ATIS examples, focusing on the 
ambiguous predicate serve. 

(17.3) Well, there was the time they served green-lipped mussels from New 
Zealand. 

(17.4) Which airlines serve Denver? 

(17.5) Which ones serve breakfast? 

Here the sense of serve in 17.3 requires some kind of food as its patient, 
the sense in 17.4 requires some kind of geographical or political entity, and 
the sense in the last example requires a meal designator. If we assume that 
mussels , Denver and breakfast are unambiguous, then in it is the arguments 
in these examples that select the appropriate sense of the verb. 

Of course, there are also cases where both the predicate and the argu¬ 
ment have multiple senses. Consider the following BERP example. 

(17.6) I’m looking for a restaurant that serves vegetarian dishes. 

Restricting ourselves to three senses of serve and two senses of dish yields 
six possible sense combinations in this example. However, since only one 
combination of the six is free from a selection restriction violation, determin¬ 
ing the correct sense of both serve and dish is straightforward. In particular, 
the predicate and argument mutually select the correct senses. 

Before moving on, we should note there will always be examples like 
the following where the available selection restrictions are too general to 
uniquely select a correct sense. 

(17.7) What kind of dishes do you recommend? 

In cases like this we either have to rely on the stand-alone methods discussed 
in 17.2, or knowledge of the broader discourse context, as will be discussed 
in Chapter 18. 
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Although there are a wide variety of ways to integrate this style of 
disambiguation into a semantic analyzer, the most straightforward approach 
follows the rule-to-rule strategy introduced in Chapter 15. In this integrated 
approach, fragments of meaning representations are composed and checked 
for selection restriction violations as soon as their corresponding syntactic 
constituents are created. Those representations that contain selection restric¬ 
tion violations are eliminated from further consideration. 

This approach requires two additions to the knowledge structures used 
in our semantic analyzers: access to hierarchical type information about the 
arguments, and semantic selection restriction information about the argu¬ 
ments to predicates . Recall from Chapter 16, that both of these can be 
encoded using knowledge from WordNet. The first is available in form of 
the hypemym information about the heads of the meaning structures being 
used as arguments to predicates. Similarly, selection restriction information 
about argument roles can be encoded by associating the appropriate WordNet 
synsets with the arguments to each predicate-bearing lexical item. Exercise 
?? asks you to explore this approach in more detail. 

Limitations of Selection Restrictions 

Not surprisingly, there are a number of practical and theoretical problems 
with this use of selection restrictions. The first symptom of these problems 
is the fact that there are many perfectly well-formed, interpretable, sentences 
that contain obvious violations of selection restrictions. Therefore, any ap¬ 
proach based on a strict elimination of such interpretations is in serious trou¬ 
ble. 

Consider the following WSJ example. 

(17.8) But it fell apart in 1931, perhaps because people realized you can’t 

eat gold for lunch if you’re hungry. 

The phrase eat gold clearly violates the selection restriction that eat places 
on its patient role. Nevertheless, this example is perfectly well-formed. 
The key is the negative environment set up by can Y prior to the violation of 
the restriction. This example makes it clear that any purely local, or rule-to- 
rule, analysis of selection restrictions will fail when a wider context makes 
the violation of a selection restriction acceptable, as in this case. 

A second problem with selection restrictions is illustrated by the fol¬ 
lowing example. 
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(17.9) In his two championship trials, Mr. Kulkami ate glass on an empty 

stomach, accompanied only by water and tea. 

Although the event described in this example is somewhat unusual, the sen¬ 
tence itself is not semantically ill-formed, despite the violation of eat s selec¬ 
tion restriction. Examples such as this illustrate the fact that thematic roles 
and selection restrictions are merely loose approximations of the deeper con¬ 
cepts they represent. They can not hope to account for uses such as this that 
require deeper commonsense knowledge about what eating is all about. At 
best, they reflect the idea that the things that are eaten are normally edible. 

Finally, as discussed in Chapter 16, metaphoric and metonymic uses 
challenge this approach as well. Consider the following WSJ example. 

(17.10) If you want to kill the Soviet Union, get it to try to eat Afghanistan. 

Here the typical selection restrictions on the patients of both kill and eat 
will eliminate all possible literal senses leaving the system with no possible 
meanings. In many systems, such a situation serves to trigger alternative 
mechanisms for interpreting metaphor and metonymy (Fass, 1997). 

As Hirst (1987) observes, examples like these often result in the elim¬ 
ination of all senses, bring semantic analysis to a halt. One approach to 
alleviating this problem is to adopt the view of selection restrictions as pref¬ 
erences, rather than rigid requirements. Although there have been many 
instantiations of this approach over the years (Wilks, 1975c, 1975b, 1978), 
the one that has received the most thorough empirical evaluation is Resnik’s 
(1998) work, which uses the notion of a selectional association introduced 
on page ??. Recall that this notion uses an empirically derived measure of 
the strength of association between a predicate and a class dominating the 
argument to the predicate. 

A simplified version of Resnik’s disambiguation algorithm is shown in 
Figure 17.1. The basic notion behind this algorithm is to select as the correct 
sense for the argument, the one that has the highest selectional association 
between one of its ancestor hypemyms and the predicate. Resnik (1998) re¬ 
ports an average of 44% correct with this technique for verb-object relation¬ 
ships, a result that is an improvement over a most frequent sense baseline. 
A limitation of this approach is that it only addresses the case where the 
predicate is unambiguous and selects the correct sense of the argument. A 
more complex decision criteria would be needed for the more likely situation 
where both the predicate and argument are ambiguous. 
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function SA-WSD(prai, arg) returns sense 

best-association <-Mininum possible selection association 
for each sense in senses of arg do 
for each hypernym in hypemyms of sense do 

new<- Selectional association between hyp and pred 
if new > best-association then 
best-association <- new 
best-sense «— sense 

end 

end 

return best-sense 


Figure 17.1 Resnik’s (1998) selectional association-based word sense dis¬ 
ambiguation algorithm. The selection association between all the hypemyms 
of all the senses of the target argument and the predicate are computed. The 
sense with the most closely associated hypemym is selected. 


17.2 Robust Word Sense Disambiguation 


The selection restriction approach to disambiguation has too many require¬ 
ments to be useful in large-scale practical applications. Even with the use 
of WordNet, the requirements of complete selection restriction information 
for all predicate roles, and complete type information for the senses of all 
possible fillers are unlikely to be met. In addition, as we saw in Chapters 10, 
12, and 15, the availability of a complete and accurate parse for all inputs is 
unlikely to be met in environments involving unrestricted text. 

To address these concerns, a number of robust disambiguation systems 
with more modest requirements have been developed over the years. As 
with part-of-speech taggers, these systems are designed to operate in a stand¬ 
alone fashion and make minimal assumptions about what information will be 
available from other processes. 

Machine Learning Approaches 

In machine learning approaches, systems are trained to perform the task 
of word sense disambiguation. In these approaches, what is learned is a 
classifier that can be used to assign as yet unseen examples to one of a fixed 
number of senses. As we will see, these approaches vary as to the nature 
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of the training material, how much material is needed, the degree of human 
intervention, the kind of linguistic knowledge used, and the output produced. 

What they all share is an emphasis on acquiring the knowledge needed for 
the task from data, rather than from human analysts. The principal question 
to keep in mind as we explore these systems is whether the method scales; 
that is, would it be possible to apply the method to a substantial part of the 
entire vocabulary of a language? 

The Inputs: Feature Vectors 

Before discussing the algorithms, we should first characterize the kind of 
inputs they expect. In most of these approaches, the initial input consists of 
the word to be disambiguated, which we will refer to as the target word, 
along with a portion of the text in which it is embedded, which we will call 
its context. This initial input is then processed in the following ways: 

• The input is normally part-of-speech tagged using one of the high ac¬ 
curacy methods described in Chapter 8. 

• The original context may be replaced with larger or smaller segments 
surrounding the target word. 

• Often some amount of stemming, or more sophisticated morphological 
processing, is performed. 

• Less often, some form of partial parsing, or dependency parsing, is 
performed to ascertain thematic or grammatical roles and relations. 

After this initial processing, the input is then boiled down to a fixed set 
of features that capture information relevant to the learning task. This task 
consists of two steps: selecting the relevant linguistic features, and encoding 
them in a form usable in a learning algorithm. Fortunately, a simple feature 
vector consisting of numeric or nominal values can easily encode the most vector e 
frequently used linguistic information, and is appropriate for use in most 
learning algorithms 

The linguistic features used in training WSD systems can be roughly 
divided into two classes: collocational features and co-occurrence features. 

In general, the term collocation refers to a quantifiable position-specific re- collocation 
lationship between two lexical items. Collocational features encode infor¬ 
mation about the lexical inhabitants of specific positions located to the left 
and right of the target word. Typical items in this category include the word, 
the root form of the word, and the word’s part-of-speech. This type of fea¬ 
ture is effective at encoding local lexical and grammatical information that 
can often accurately isolate a given sense. 
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As an example of this type of feature-encoding, consider the situation 
where we need to disambiguate the lexeme bass in the following example. 

(17.11) An electric guitar and bass player stand off to one side, not really 
part of the scene, just as a sort of nod to gringo expectations perhaps. 

A feature-vector consisting of the two words to the right and left of the target 
word, along with their respective parts-of-speech, would yield the following 
vector. 

[guitar, NN1, and, CJC, player, NN1, stand, VVB] 

The second type of feature consists of co-occurrence data about neigh¬ 
boring words, ignoring their exact position. In this approach, the words 
themselves (or their roots) serve as features. The value of the feature is the 
number of times the word occurs in a region surrounding the target word. 
This region is most often defined as a fixed size window with the target word 
at the center. To make this approach manageable, a small number of fre¬ 
quently used content words are selected for use as features. This kind of 
feature is effective at capturing the general topic of the discourse in which 
the target word has occurred. This, in turn, tends to identify senses of a word 
that are specific to certain domains. 

For example, a co-occurrence vector consisting of the 12 most frequent 
content words from a collection of bass sentences drawn from the WSJ cor¬ 
pus would have the words as features: fishing, big, sound , player, fly, rod, 
pound, double, runs, playing, guitar, band. Using these words as features 
with a window size of 10, Example 17.11 would be represented by the fol¬ 
lowing vector. 


[ 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 ] 

As we will see, most robust approaches to sense disambiguation make 
use of a combination of both collocational and co-occurrence features. 

Supervised Learning Approaches 

In supervised approaches, a sense disambiguation system is learned from a 
representative set of labeled instances drawn from the same distribution as 
the test set to be used. This is a straightforward application of the supervised 
learning approach to creating a classifier. In such approaches, a learning 
system is presented with a training set consisting of feature-encoded inputs 
along with their appropriate label, or category. The output of the system is a 
classifier system capable of assigning labels to new feature-encoded inputs. 
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Methodology Box: Evaluating WSD Systems 


The basic metric used in evaluating sense disambiguation sys¬ 
tems is simple precision: the percentage of words that are tagged 
correctly. The primary baseline against which this metric is com¬ 
pared is the most frequent sense metric: how well would a system 
do if it simply chose the most frequent sense of a word. 

The use of precision requires access to the correct answers to the 
words in a test set. Fortunately, two large sense-tagged corpora are 
now available: the semcor corpus (Landes etai, 1998), which con¬ 
sists of a portion of the Brown corpus tagged with WordNet senses, 
and the senseval corpus (Kilgarriff and Roscnzwcig, 2000), which 
is a tagged corpus derived from the HECTOR corpus and dictionary 
project. 

A number of issues must be taken into account in comparing 
results across systems. The main issue concerns the nature of the 
senses used in the evaluation. Two approaches have been followed 
over the years: coarse distinctions among homographs, such as the 
musical and fish senses of bass, and fine-grained sense distinctions 
such as those found in traditional dictionaries. Unfortunately, there 
is no standard way of comparing results across these two kinds of 
efforts, or across efforts using different dictionaries. 

Dictionary senses provide the opportunity for a more fine¬ 
grained scoring metric than simple precision. For example, con¬ 
fusing a particular musical sense of bass with a fish sense, is clearly 
worse than confusing it with another musical sense. This observa¬ 
tion gives rise to a notion of partial credit in evaluating these sys¬ 
tems. With such a metric, an exact sense-match would receive full 
credit, while selecting a broader sense would receive partial credit. 
Of course, this kind of scheme is entirely dependent on the organi¬ 
zation of senses in the particular dictionary being used. 

Standardized evaluation frameworks for word sense disam¬ 
biguation systems are now available. In particular, the SENSEVAL 
effort (Kilgarriff and Palmer, 2000), provides the same kind of eval¬ 
uation framework for sense disambiguation, that the MUC (Sund- 
heim, 1995b) and TREC (Voorhees and Hannan, 1998) evaluations 
have provided for information extraction and information retrieval. 
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Bayesian classifiers (Duda and Hart, 1973), decision lists (Rivest, 1987), 
decision trees (Quinlan, 1986), neural networks (Rumelhart et al , 1986), 
logic learning systems (Mooney, 1995), and nearest neighbor methods(Cover 
and Hart, 1967) all fit into this paradigm. We will restrict our discussion to 
the naive Bayes and decision list approaches, since they have been the focus 
of considerable work in word sense disambiguation. 
naive bayes The naive Bayes classifier approach to WSD is based on the premise 

that choosing the best sense for an input vector amounts to choosing the most 
probable sense given that vector. In other words: 


s = argmax P(s|V) 

ses 


(17.12) 


In this formula, S denotes the set of senses appropriate for the target associ¬ 
ated with this vector. As is almost always the case, it would be difficult to 
collect statistics for this equation directly. Instead, we rewrite it in the usual 
Bayesian manner as follows: 


S = argmax 

s£S 


P{V\s)P{s) 

P(V) 


(17.13) 


Of course, the data available that associates specific vectors with senses 
is too sparse to be useful. What is provided in abundance in the training set 
is information about individual feature-value pairs in the context of specific 
senses. Therefore, we can make the same independence assumption that 
has served us well in part-of-speech tagging, speech recognition, and prob¬ 
abilistic parsing — assume that the features are independent of one another. 
Making this assumption yields the following equation. 

P(V\s) = f\P(vj\s) (17.14) 

7=1 

Given this equation, training a Naive Bayes classifier amounts to col¬ 
lecting counts of the individual feature-value statistics with respect to each 
sense of the target word. The term P(s) is the prior for each sense, which just 
corresponds to the proportion of each sense in the training set. Finally, since 
P(V) is the same for all possible senses it does not effect the final ranking of 
senses, leaving us with the following. 

n 

s = argmax P(s) (17.15) 

sinS j—\ 

Of course, all the issues discussed in Chapter 8 with respect to zero counts 
and smoothing apply here as well. 
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Rule Sense 


fish within window 


bass 

striped bass 


bass 

guitar within window 


bass 

bass player 


bass 

piano within window 


bass 

tenor within window 


bass 

sea bass 


bass 

play/W bass 


bass 

river within window 


bass 

violin within window 


bass 

salmon within window 


bass 

on bass 


bass 

bass are 


bass 


Figure 17.2 An abbreviated decision list for disambiguating the fish sense 
of bass from the music sense. (Adapted from (Yarowsky, 1996)) 


In a large experiment evaluating a number of supervised learning al¬ 
gorithms, Mooney (1996) reports that a naive-Bayes classifier and a neural 
network achieved the highest performance, both achieving around 73% cor¬ 
rect in assigning one of 6 senses to a corpus of examples of the word line. 

Decision list classifiers can be viewed as a simplified variant of deci¬ 
sion trees. In a decision list classifier, a sequence of tests is applied to each 
vector encoded input. If a test succeeds, then the sense associated with that 
test is applied to the input and returned. If the test fails, then the next test 
in the sequence is applied. This continues until the end of the list, where a 
default test simply returns the majority sense. Figure 17.2 shows a portion 
of a decision list for the task of discriminating the fish sense of bass from the 
music sense. 

Learning a decision list classifier consists of creating a good sequence 
of tests based on the characteristics of the training data. There are wide 
number of methods that can be used to create such lists. Yarowsky (1994) 
employs an extremely simple technique that yields excellent results in this 
domain. In this approach, all possible feature-value pairs are used to create 
tests. These individual tests are then ordered according to their individual 
accuracy on the training set, where the accuracy of a test is based on its 


DECISION 

LIST 

CLASSIFIERS 
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log-likelihood ratio: 


Abs(Log 


( 


P(Sensei\fj = Vj) 
P(Sense 2 \fi = vj) 


)> 


(17.16) 


The decision list is created from these tests by simplying ordering the tests 
in the list according to this measure, with each test returning the appropriate 
sense. Yarowsky (1996) reports that this technique consistently achieves 
over 95% correct on a wide variety of binary decision tasks. 

We should note that this training method differs quite a bit from the 
standard decision list learning algorithm. For the details and theoretical mo¬ 
tivation for that approach see (Rivest, 1987; Russell and Norvig, 1995). 


Bootstrapping Approaches 

Not surprisingly, a major problem with supervised approaches is the need 
for a large sense-tagged training set. The bootstrapping approach (Hearst, 
1991; Yarowsky, 1995) eliminates the need for a large training set by relying 
on a relatively small number of instances of each sense for each lexeme of 
interest. These labeled instances are used as seeds to train an initial classifier 
using any of the supervised learning methods mentioned in the last section. 
This initial classifier is then be used to extract a larger training set from 
the remaining untagged corpus. Repeating this process results in a series of 
classifiers with improving accuracy and coverage. 

The key to this approach lies in its ability to create a larger training set 
from a small set of seeds. To succeed, it must include only those instances 
in which the initial classifier has a high degree of confidence. This larger 
training set is then used to create a new more accurate classifier with broader 
coverage. With each iteration of this process, the training corpus grows and 
the untagged corpus shrinks. As with most iterative methods, this process 
can be repeated until some sufficiently low error-rate on the training set is 
reached, or until no further examples from the untagged corpus are above 
threshold. 

The initial seed set used in these bootstrapping methods can be gen¬ 
erated in a number of ways. Hearst (1991) generates a seed set by hand 
labeling a small set of examples from the initial corpus. This approach has 
three major advantages: 

• There is a reasonable certainty that the seed instances are correct, thus 
ensuring that the learner does not get off on the wrong foot 

• The analyst can make some effort to choose examples that are not only 
correct, but in some sense prototypical of each sense. 
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Klucevsek plays Giulietti or Titano piano accordions with the more flexible, more 
difficult free bass rather than the traditional Stradella bass with its preset chords 
designed mainly for accompaniment. 

We need more good teachers - right now, there are only a half a dozen who can 
play the free bass with ease. 

An electric guitar and bass player stand off to one side, not really part of the 
scene, just as a sort of nod to gringo expectations perhaps. 

When the New Jersey Jazz Society, in a fund-raiser for the American Jazz Hall of 
Fame, honors this historic night next Saturday, Harry Goodman, Mr. Goodman’s 
brother and bass player at the original concert, will be in the audience with other 
family members. 

The researchers said the worms spend part of their life cycle in such fish as Pacific 
salmon and striped bass and Pacific rockfish or snapper. 

Associates describe Mr. Whitacre as a quiet, disciplined and assertive manager 
whose favorite form of escape is bass fishing. 

And it all started when fishermen decided the striped bass in Lake Mead were too 
skinny. 

Though still a far cry from the lake’s record 52-pound bass of a decade ago, ’’you 
could fillet these fish again, and that made people very, very happy,” Mr. Paulson 
says. 

Saturday morning I arise at 8:30 and click on ’’America’s best-known fisherman,” 
giving advice on catching bass in cold weather from the seat of a bass boat in 
Louisiana. 

Figure 17.3 Samples of bass sentences extracted from the WSJ using the 
simple correlates play and fish. 


• It is reasonably easy to carry out. 

A remarkably effective alternative technique is to simply search for 
sentences containing single words that are strongly correlated with the target 
senses. Yarowsky (1995) calls this the One Sense per Collocation constraint 
and presents results that show that it yields remarkably good results. For 
example, Figure 17.3 shows a partial result of a such a search for the strings 
“fish” and “play” in a corpus of bass examples drawn from the WSJ. 

Yarowsky (1995) suggests two methods to select effective correlates: 
deriving them from machine readable dictionary entries, and selecting seeds 
using collocations statistics such as those described in Chapter 6. Putting all 
of this to the test, Yarowsky (1995) reports an average performance of 96.5% 
on a coarse binary sense assignment of 12 words. 



640 


Chapter 17. Word Sense Disambiguation and Information Retrieval 


AGGLOMERA¬ 

TE 

CLUSTERING 


Unsupervised Methods: Discovering Word Senses 

Unsupervised approaches to sense disambiguation eschew the use of sense 
tagged data of any kind during training. In these approaches, feature-vector 
representations of unlabeled instances are taken as input and are then grouped 
into clusters according to a similarity metric. These clusters can then be rep¬ 
resented as the average of their constituent feature-vectors, and labeled by 
hand with known word senses. Unseen feature-encoded instances can be 
classified by assigning them the word sense from the cluster to which they 
are closest according to the similarity metric. 

Fortunately, clustering is a well-studied problem with a wide number 
of standard algorithms that can be applied to inputs structured as vectors of 
numerical values (Duda and Hart, 1973). The most frequently used tech¬ 
nique in language applications is known as agglomerative clustering. In 
this technique, each of the N training instances is initially assigned to its 
own cluster. New clusters are then formed in a bottom-up fashion by succes¬ 
sively merging the two clusters that are most similar. This process continues 
until a either a specified number of clusters is reached, or some global good¬ 
ness measure among the clusters is achieved. In cases where the number of 
training instances makes this method too expensive, random sampling can 
be used on the original training set (Cutting et al ., 1992b) to achieve similar 
results. 

Of course, the fact that these unsupervised methods do not make use 
of hand-labeled data poses a number of challenges for evaluating the good¬ 
ness of any clustering result. The following problems are among the most 
important ones that have to be addressed in unsupervised approaches. 

• The correct senses of the instances used in the training data may not be 
known. 

• The clusters are almost certainly heterogeneous with respect to the 
senses of the training instances contained within them. 

• The number of clusters is almost always different from the number of 
senses of the target word being disambiguated. 

Schiitze’s experiments (Schiitze, 1992, 1998) constitute the most ex¬ 
tensive application of unsupervised clustering to word sense disambiguation 
to date. Although the actual technique is quite involved, unsupervised ag¬ 
glomerative clustering is at the core of the method. As with the supervised 
approaches, the bulk of this work is directed at coarse binary distinctions. In 
this work, the first two problems are addressed through the use of pseudo¬ 
words and a hand-labeling of a small subset of the instances in each cluster. 
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The heterogeneity issue is addressed by assigning the majority sense to each 
of the induced clusters. Given this approach, the last problem is not an is¬ 
sue; the various discovered clusters are simply labeled with their majority 
sense. The fact that there may be multiple clusters with the same sense is not 
directly an issue in disambiguation. 

Schiitze’s results indicate that for coarse binary distinctions, unsuper¬ 
vised techniques can achieve results approaching those of supervised and 
bootstrap methods. In most instances approaching the 90% range. As with 
most of the supervised methods, this method was tested on a small sample 
of words (10 pseudowords, and 10 real words). 

Dictionary-Based Approaches 

A major drawback with all of the approaches described above is the problem 
of scale. All require a considerable amount of work to create a classifier for 
each ambiguous entry in the lexicon. For this reason, most of the experi¬ 
ments with these methods report results ranging from 2 to 12 lexical items 
(The work of Ng and Lee (1996) is a notable exception reporting results dis¬ 
ambiguating 121 nouns and 70 verbs). Scaling up any of these approaches to 
deal with all the ambiguous words in a language would be a large undertak¬ 
ing. Instead, attempts to perform large-scale disambiguation have focused on 
the use of machine readable dictionaries, of the kind discussed in Chap¬ 
ter 16. In this style of approach, the dictionary provides both the means for 
constructing a sense tagger, and the target senses to be used. 

The first implementation of this approach is due to Lesk (1986). In 
this approach, all the sense definitions of the word to be disambiguated are 
retrieved from the dictionary. These senses are then compared to the dictio¬ 
nary definitions of all the remaining words in the context. The sense with the 
highest overlap with these context words is chosen as the correct sense. Note 
that the various sense definitions of the context words are simply lumped to¬ 
gether in this approach. Lesk reports accuracies of 50-70% on short samples 
of text selected from Austen’s Pride and Prejudice and an AP newswire ar¬ 
ticle. 

The problem with this approach is that dictionary entries for the vari¬ 
ous senses of target words are relatively short, and may not provide sufficient 
material to create adequate classifiers. 1 More specifically, the words used 
in the context and their definitions must have direct overlap with the words 

L Indeed, Lesk (Lesk, 1986) notes that the performance of his system seems to roughly 
correlate with the length of the dictionary entries. 
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contained in the appropriate sense definition in order to be useful. One way 
to remedy this problem is to expand the list of words used in the classifier 
to include words related to, but not contained in their individual sense def¬ 
initions. This can be accomplished by including words whose definitions 
make use of the target word. For example, the word deposit does not oc¬ 
cur in the definition of bank in the American Heritage Dictionary (Morris, 
1985). However, bank does occur in the definition of deposit. Therefore, the 
classifier for bank can be expanded to include deposit as a relevant feature. 

Of course, just knowing that deposit is related to bank does not help 
much since we don’t know to which of bank's senses it is related. Specifi¬ 
cally, to make use of deposit as a feature we have to know which sense of 
bank was being used in its definition. Fortunately, many dictionaries and 
thesauri include tags known as subject codes in their entries that correspond 
roughly to broad conceptual categories. For example, the entry for bank 
in the Longman's Dictionary of Contemporary English (LDOCE) (Procter, 
1978) includes the subject code EC (Economics) for the financial senses of 
bank. Given such subject codes, we can guess that expanded terms with 
the subject code EC will be related to this sense of bank rather than any of 
the others. Guthrie et al. (1991) report results ranging of 47% correct for 
fine-grained LDOCE distinctions to 72% for more coarse distinctions. 

Note that none of these techniques actually exploit the dictionary en¬ 
tries as definitions. Rather, they can be viewed as variants of the supervised 
learning approach, where the content of the dictionary is used to provide the 
tagged training materials. 


Information Retrieval 

The field of information retrieval is of interest to us here due to its widespread 
adoption of word-based indexing and retrieval methods. Most current infor¬ 
mation retrieval systems are based on an extreme interpretation of the princi¬ 
ple of compositional semantics. In these systems, the meaning of documents 
resides solely in the words that are contained within them. To revisit the 
Mad Hatter’s quote from the beginning of Chapter 16, in these systems I see 
what I eat and I eat what I see mean precisely the same thing. The order¬ 
ing and constituency of the words that make up the sentences that make up 
documents play no role in determining their meaning. Because they ignore 
syntactic information, these approaches are often referred to as bag of words 
methods. 
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Before moving on, we need to introduce some new terminology. In in¬ 
formation retrieval, a document refers generically to the unit of text indexed document 
in the system and available for retrieval. Depending on the application, a 
document can refer to anything from intuitive notions like newspaper arti¬ 
cles, or encyclopedia entries, to smaller units such as paragraphs and sen¬ 
tences. In Web-based applications, it can refer to a Web page, a part of a 
page, or to an entire Web-site. A collection refers to a set of documents be- collection 
ing used to satisfy user requests. A term refers to a lexical item that occurs term 
in a collection, but it may also include phrases. Finally, a query represents a query 
user’s information need expressed as a set of terms. 

The specific information retrieval task that we will consider in detail is 
known as ad hoc retrieval. In this task, it is assumed that an unaided user retrieval 
poses a query to a retrieval system, which then returns a possibly ordered 
set of potentially useful documents. Several other related, lexically oriented, 
information retrieval tasks will be discussed in Section 17.4. 

The Vector Space Model 

In the vector space model of information retrieval, documents and queries space model 
are represented as vectors of features representing the terms that occur within 
them (Salton, 1971). More properly, they are represented as vectors of fea¬ 
tures consisting of the terms that occur within the collection , with the value 
of each feature indicating the presence or absence of a given term in a given 
document. These vectors can be denoted as follows: 

d = (fi,f2,f3,***;6v) 

Q= (h,f2,f3,***:6v) 

In this notation, the various t features represent the N terms that occur in the 
collection. Let’s first consider the case where these features take on the value 
of one or zero, indicating the presence or absence of a term in a document 
or query. Given this approach, a simple way to compare a document to a 
query, or another document, is to sum up the number of terms they have in 
common, as in the following equation. 

N 

s(qk, dj ) = £ x tij (17.17) 

i= 1 

Of course, a problem with the use of binary values for features is that 
it fails to capture the fact that some terms are more important to the meaning 
of a document than others. A useful generalization is to replace the ones 
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DOCUMENT 


and zeroes with numerical weights that indicate the importance of the vari¬ 
ous terms in particular documents and queries. We can thus generalize our 
vectors as follows: 

dj = {w 1 j,W 2 j,w 3 j,---,w n j) 

qk = (wi t k,w 2 ,k,W3,kr--,w n ,k) 

This characterization of individual documents as vectors of term weights 
allows us to view the document collection as a whole a matrix of weights, 
where Wij represents the weight of term i in document j. This weight ma¬ 
trix is typically called a term-by-document matrix. Under this view, the 
columns of the matrix represent the documents in the collection, and the 
rows represent the terms. 

A useful view of this model conceives of the features used to represent 
documents (and queries) as dimensions in a multi-dimensional space. Corre¬ 
spondingly, the weights that serve as values for those features serve to locate 
documents in that space. When a user’s query is translated into a vector it 
denotes a point in that space. Documents that are located close to the query 
can then be judged as being more relevant than documents that are farther 
away. 

This characterization of documents and queries as vectors, provides all 
the basic parts for an ad hoc retrieval system. A document retrieval system 
can simply accept a user’s query, create a vector representation for it, com¬ 
pare it against the vectors representing all known documents, and sort the 
results. The result is a list of documents rank ordered by their similarity to 
the query. 

Consider as an example of this approach, the space shown in Figure 
17.4. This figure shows a simplified space consisting of the three dimensions 
corresponding to the terms speech , language and processing. The three vec¬ 
tors illustrated in this space represent documents derived from the chapter 
and section headings of Chapters 1, 7, and 13 of this text, which we’ll de¬ 
note as Docl, Doc7, and Docl3, respectively. If we identify term weights 
with raw term frequency, then Docl is represented by the vector (1,2,1), 
Doc7 by (6,0,1), and Docl3 by (0,5,1). As is clear from the figure, this 
space captures certain intuitions about how these chapters are related. Chap¬ 
ter 1, being general, is fairly similar to both Chapters 7 and 13. Chapters 7 
and 13, on the other hand, are distant from one another since they cover a 
different set of topics. 

Unfortunately, this particular instantiation of a vector space places too 
much emphasis on the absolute values of the various coordinates of each 
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document. For example, what is important about the speech dimension of 
the Doc7, is not the value 6 but rather that it is the dominant contributor to 
the meaning of that document. Similarly, the specific values of 1, 2, and 1 
for Docl are not important, what is important is that the three dimensions 
have roughly similar weights. It would be sensible, for example, to assume 
that a new document with weights 3, 6, and 3 would be quite similar to Docl 
despite the magnitude differences in the term weights. 

We can accomplish this effect by normalizing the document vectors. 
By normalizing, we simply mean converting all the vectors to a standard 
length. Converting to a unit length can be accomplished by dividing each 
of their dimensions by the overall length of the vector, which is defined as 
Yd=i w h This, in effect, eliminates the importance of the exact length of a 
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document’s vector in the space, and emphasizes instead the direction of the 
document vector with respect to the origin. 

Applying this technique to our three sample documents results in the 
following term-by-document matrix, A, where the columns represent Docl, 
Doc7 and Docl3 and the rows represent the terms speech , language , and 
processing. 

/ .41 .81 .41 \ 

A= .98 0 .16 

V 0 .98 .19 / 


You should verify that with this scheme, the normalized vectors for Docl 
and our hypothetical (3,6,3) document end up as identical vectors. 

Now let’s return now to the topic of determining the similarity between 
vectors. Updating the similarity metric given earlier with numerical weights 
rather than binary values, gives us the following equation. 

N 

s((jk-,dj) = q k -dj = Y,Wi,k X Wij (17.18) 

i=l 

This equation specifies what is known as the dot product between vectors. 
Now, in general, the dot product between two vectors is not particularly use¬ 
ful as a similarity metric, since it is too sensitive to the absolute magnitudes 
of the various dimensions. However, the dot product between vectors that 
have been normalized has a useful and intuitive interpretation: it computes 
the cosine of the angle between two vectors. When two documents are iden¬ 
tical they will receive a cosine of one; when they are orthogonal (share no 
common terms) they will receive a cosine of zero. 

Note that if for some reason the vectors are not stored in a normalized 
form, then the normalization can be incorporated directly into the similarity 
measure as follows. 


stfkjj) 


E/Ll Wi,k X Wij 



(17.19) 


Of course, in situations where the document collection is relatively static and 
many queries are being performed, it makes sense to normalize the document 
vectors once and store them, rather than include the normalization in the 
similarity metric. 

Let’s consider how this similarity metric would work in the context 
of some small examples. Consider the carefully selected query consisting 
solely of the terms speech , language and processing. Converting this query 
to a vector and normalizing it results in the vector (.57, .57, .57). Computing 



Section 17.3. Information Retrieval 


647 


the cosines between this vector and our three document vectors shows that 
Docl is closest with a cosine of .92, followed by Docl3 with a cosine of 
.67, and finally Doc7 with a cosine of .65. Not surprisingly, this ranking is 
in close accord with our intuitions about the relationship between this query 
and these documents. 

Now consider a shorter query consisting solely of the terms speech and 
processing. Processing this query yields the normalized vector (.70,0, .70). 

When the cosines are computed between this vector and our documents, 

Doc7 is now the closest with a cosine of .80, followed by Docl with a score 
of .58, with Docl3 coming in a distant third with a cosine of .13. 

Term Weighting 

In practice, the method used to assign terms weights in the document and 
query vectors has an enormous impact on the effectiveness of a retrieval 
system. Two factors have proven to be critical in deriving effective term 
weights: term frequency within a single document, and the distribution of 
terms across a collection. We can begin with the simple notion that terms that 
occur frequently within a document may reflect its meaning more strongly 
than terms that occur less frequently and should thus have higher weights. 

In its simplest form, this factor is called term frequency and is simply the frequency 
raw frequency of a term within a document (Luhn, 1957). 

The second factor to consider is the distribution of terms across the col¬ 
lection as a whole. Terms that are limited to a few documents are useful for 
discriminating those documents from the rest of the collection. On the other 
hand, terms that occur frequently across the entire collection are less useful 
in discriminating among documents. What is needed therefore is a measure 
that favors terms that occur in fewer documents. The fraction N/n^ where 
N is the total number of documents in the collection, and n is the number of 
documents in which term i occurs, provides exactly this measure. The fewer 
documents a term occurs in, the higher this weight. The lowest weight of 1 
is assigned to terms that occur in all the documents. Due to the large num¬ 
ber of documents in many collections, this measure is usually squashed with 
a log function leaving us with the following inverse document frequency docum|nt 

rntUUtNOY 

term weight (Sparck Jones, 1972). 

idfi = log(-) (17.20) 

Mi 

Combining the term frequency factor with this factor results in a scheme 
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Methodology Box: Evaluating Information Re¬ 
trieval Systems 

Information retrieval systems are evaluated with respect to the 
notion of relevance — a judgment by a human that a document is 
relevant to a query. A system’s ability to retrieve relevant documents 
is assessed with a recall measure, as in Chapter 15. 

Recall —_ #of relevant documents returned 

Lotal # of relevant documents in the collection 

Of course, a system can achieve 100% recall by simply return¬ 
ing all the documents in the collection. A system’s accuracy is based 
on how many of the documents returned for a given query arc actu¬ 
ally relevant, which can be assessed by a precision metric. 

Piwidnn — # of relevant documents returned 
# of documents returned 

These measures are complicated by the fact that most systems 
do not make explicit relevance judgments, but rather rank their col¬ 
lection with respect to a query. To deal with this we can specify a 
set of cutoffs in the output, anti measure average precision for the 
documents ranked above the cutoff. Alternatively, we can specify 
a set of recall levels anti measure average precision at those levels. 
This latter method gives rise to what are known as precision-recall 
curves as shown in Figure 17.5. As these curves show, comparing 
the performance of two systems can be difficult. In this comparison, 
one system is better at both high and low levels of recall, while the 
other is better in the middle region. An alternative to these curves 
are metrics that attempt to combine recall and precision into a single 
value. The F measure introduced on page 576 is one such measure. 

The U.S. government sponsored TREC (Text RElricval Confer¬ 
ence) evaluations have provided a rigorous testbed for die evalua¬ 
tion of a variety of information retrieval tasks and techniques. Like 
the MUC evaluations, TREC provides large document sets for both 
training and testing, along with a uniform scoring system, '[’rain¬ 
ing materials consist of sets of documents accompanied by sets of 
queries (called topics in TREC) and relevance judgments. Voorhees 
and Harman (1998) provides die details for the most recent meeting. 
Details of all of the meetings can be found at the TREC page on the 
National Institute of Standards and Technology Web site. 
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known as tf- idf weighting. 

= tfij x idfi (17.21) 

That is, the weight of term i in the vector for document j is the product of 
its overall frequency in j with the log of its inverse document frequency in 
the collection. With some minor variations, this weighting scheme is used to 
assign term weights to documents in nearly all vector space retrieval models. 

Despite the fact that we use the same representations for documents 
and queries, it is not at all clear that the same weighting scheme should be 
used for both. In many ad hoc retrieval settings such as Web search engines, 
user queries are not very much like documents at all. For example, an analy¬ 
sis of a very large set of queries (1,000,000,000 actually) from the AltaVista 
search engine reveals that the average query length is around 2.3 words (Sil¬ 
vers tein et al , 1998). In such an environment, the raw term frequency in the 
query is not likely to be a very useful factor. Instead, Salton and Buckley 
(1988) recommend the following formula for weighting query terms, where 
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Maxjtfjjc denotes the frequency of the most frequent term in document k. 



O-Stfjjc 
Ma Xjtfjf 


x idfi 


(17.22) 


Term Selection and Creation 

We have been assuming thus far that it is precisely the words that occur in 
a collection that will be used to index the documents in the collection. Two 
common variations on this assumption involve the use of stemming, and a 
stop list. 

stemming The notion of stemming takes us back to Chapter 3 and the topic mor¬ 

phological analysis. The basic question addressed by stemming is whether 
the morphological variants of a lexical item should be listed (and counted) 
separately, or whether they should be collapsed into a single root form. For 
example, without stemming, the terms process , processing and processed 
will be treated as distinct items with separate term frequencies in a term-by¬ 
document matrix; with stemming they will be conflated to the single term 
process with a single summed frequency count. The major advantage to us¬ 
ing stemming is that it allows a particular query term to match documents 
containing any of the morphological variants of the term. The Porter stem- 
mer (Porter, 1980) described Chapter 3 is the system most-used for this pur¬ 
pose retrieval from collections of English documents. 

A significant problem with this approach is that it throws away useful 
distinctions. For example, consider the use of the Porter stemmer on docu¬ 
ments and queries containing the words stocks and stockings. In this case, 
the Porter stemmer reduces these surface forms to the single term stock. Of 
course, the result of this is that queries concerning stock prices will return 
documents about stockings , and queries about stockings will find documents 
about stocks. 2 More technically, stemming may increase recall by find¬ 
ing documents with terms that are morphologically related to queries, but it 
may also reduce precision by returning semantically unrelated documents. 
For this reason, few Web search engines currently make use of stemming. 
Frakes and Baeza-Yates (1992) presents results from a series of experiments 
that explore the efficacy of stemming. 

A second common technique is the use of stop lists, which address 

2 This example is motivated by some bad publicity received by a well-known search engine, 
when it returned some rather salacious sites containing extensive use of the term stockings in 
response to queries concerning stock prices. In response, a spokesman announced that their 
engineers were working hard on a solution to this strange problem with words. 
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the issue of what words should be allowed into the index. A stop list is stop list 
a list of high frequency words that are eliminated from the representation 
of both documents and queries. Two motivations are normally given for 
this strategy: high frequency, closed-class, terms are seen as carrying little 
semantic weight and are thus unlikely to help with retrieval, and eliminating 
them can save considerable space in the inverted index files used to map from 
terms to the documents that contain them. The downside of using a stop list 
is that it makes it difficult to search for phrases that contain words in the 
stop list. For example, a common stop list derived from the Brown corpus 
presented in (Frakes and Baeza-Yates, 1992), would reduce the phrase to be 
or not to be to the phrase not. 

Homonymy, Polysemy and Synonymy 

Since the vector space model is based solely on the use of simple terms, its 
useful to consider the effect that various lexical semantic phenomena have on 
the model. Consider a query containing the word canine with its tooth and 
dog senses. A query containing canine will be judged similar to documents 
making use of either of these senses. However, given that users are probably 
only interested in one of these senses, the documents containing the other 
sense will be judged non-relevant. Homonymy and polysemy, therefore, 
have the effect of reducing precision by leading a system to return documents 
irrelevant to the users information need. 

Now consider a query consisting of the lexeme dog. This query will 
be judged close to documents that make frequent use of the term dog , but 
may fail to match documents that use close synonyms like canine , as well as 
documents that use hyponyms such as malamute. Synonymy and hyponymy, 
therefore, have the effect of reducing recall by causing the retrieval system 
to miss relevant documents. 

Note that it is inaccurate to state flatly that that polysemy reduces preci¬ 
sion, and synonymy reduces recall since, as we discussed on page 648, both 
measures are relative to a fixed cutoff. As a result, every non-relevant docu¬ 
ment that rises above the cutoff due to polysemy takes up a slot in the fixed 
size return set, and may thus push a relevant document below threshold thus 
reducing recall. Similarly, when a document is missed due to synonymy, 
a slot is opened in the return set for a non-relevant document, potentially 
reducing precision as well. 

Not surprisingly, these issues lead to the question of whether or not 
word sense disambiguation can help in information retrieval. The evidence 
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on this point is mixed, with some experiments reporting a sizable gain using 
disambiguation (Schutze and Pedersen, 1995), and others reporting either no 
gain, or a degradation in performance (Krovetz and Croft, 1992; Yoorhees, 
1998). 

Improving User Queries 

One of the most effective ways to improve retrieval performance is to find a 
way to improve user queries. The techniques presented in this section have 
been shown to varying degrees to be effective at this task. 

The single most effective way to improve retrieval performance in the 
feedback e vector space model is the use of relevance feedback (Rocchio, 1971). In 
this method, a user presents a query to the system and is presented with a 
small set of retrieved documents. The user is then asked to specify which 
of these documents appears relevant to their need. The user’s original query 
is then reformulated based on the distribution of terms in the relevant and 
non-relevant documents that the user examined. This reformulated query is 
then passed to the system as a new query with the new results being shown to 
the user. Typically an enormous improvement is seen after a single iteration 
of this technique. 

The formal basis for the implementation of this technique falls out di¬ 
rectly from some of the basic geometric intuitions of the vector model. In 
particular, we would like to push the vector representing the user’s origi¬ 
nal query toward the documents that have been found to be relevant, and 
away from the documents judged not relevant. This can be accomplished by 
adding an averaged vector representing the relevant documents to the origi¬ 
nal query, and subtracting an averaged vector representing the non-relevant 
queries. 

More formally, let’s assume that cp represents the user’s original query, 
R is the number of relevant documents returned from the original query, and 
N is the number of non-relevant documents. In addition, assume that (3 and y 
range from 0 to 1 and that (3 + y= 1. Given these assumptions, the following 
represents a standard relevance feedback update formula. 



The factors beta and y in this formula represent parameters that can 
be adjusted experimentally. Intuitively, they represent how far the original 
vector should be pushed towards the relevant documents or away from the 
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non-relevant ones. Salton and Buckley (1990) report good results with (3 = 

.75 and y= .25. 

We should note that evaluating systems that use relevance feedback is 
rather tricky. In particular, an enormous improvement is often seen in the 
documents retrieved by the first reformulated query. This should not be too 
surprising since it includes the documents that the user has told the system 
were relevant. The preferred way to avoid this inflation is to only compute 
recall and precision measures for what is called the residual collection, the collection 
original collection without any of the documents shown to the user on any 
previous round. This usually has the effect of driving the system’s raw per¬ 
formance below that achieved with the first query, since the most highly rele¬ 
vant documents have now been eliminated. Nevertheless, this is an effective 
technique to use when comparing distinct relevance feedback mechanisms. 

An alternative approach to query improvement focuses on the terms 
that comprise the query vector, rather than the query vector itself. In query 
expansion, the users original query is expanded to include terms related to expansion 
the original terms. This has typically been accomplished by adding adding 
terms chosen from lists of terms that are highly correlated with the user’s 
original terms in the collection. Such highly correlated terms are listed in 
what is typically called a thesaurus, although since it is based on correlation, thesaurus 
rather than synonymy, it is only loosely connected to the standard references 
that carry the same name. 

Unfortunately, it is usually the case that available thesaurus-like re¬ 
sources are not suitable for most collections. In thesaurus generation, a genIration 
correlation-based thesaurus is generated automatically from all or a portion 
of the documents in the collection. Not surprisingly, one of the most popular 
methods used in thesaurus generation involves the use of term clustering, clustering 
Recall, from our characterization of the term-by-document matrix that the 
columns in the matrix represent the documents and the rows represent the 
terms. Therefore, in thesaurus generation, the rows can be clustered to form 
sets of synonyms, which can then be added to the user’s original query to 
improve its recall. 

This technique is typically instantiated in one of two ways: a thesaurus 
can be generated once from the document collection as a whole (Crouch and 
Yang, 1992), or sets of synonym-like terms can be generated dynamically 
from the returned set for the original query (Attar and Fraenkel, 1977). Note 
that this second approach entails far more effort, since in effect a small the¬ 
saurus is generated for the documents returned for every query, rather than 
once for entire collection. 
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CATEGORIZA¬ 

TION 


FILTERING 

ROUTING 


DOCUMENT 

CLUSTERING 


Other Information Retrieval Tasks 


As noted earlier, ad-hoc retrieval is not the only word-based task in infor¬ 
mation retrieval. Some of the other more important ones include document 
categorization, document clustering, and text segmentation. 

The categorization task is to assign a new document to one of a pre¬ 
existing set of document classes. In this setting, the task of creating a clas¬ 
sifier consists of discovering a useful characterization of the documents that 
belong in each class. Although this can be done by hand, the principal way 
to approach this problem is to use supervised machine learning. In particu¬ 
lar, classifiers can be trained on a set of documents that have been labeled 
with the correct class. Not surprisingly, all the supervised learning methods 
introduced on page 634 for word sense disambiguation can be applied to this 
task as well. 

When categorization is performed with the intent of then transmitting 
the document to a user or set of interested users it is usually referred to as 
filtering or routing. An interesting example of this is AT&T’s ’How May 
I Help You’ task where the goal is to classify a user’s utterance into one 
of fifteen possible categories, such as third number billing, or collect call. 
Once the system has classified the call, the system routes the caller to an 
appropriate human operator. This task provides a good example of the need 
for in vivo evaluation mentioned earlier. The classification accuracy on this 
task approaches 80 %, despite the fact that the speech recognizer has a word 
accuracy rate of only around 50 % (Gorin et al , 1997). 

The categorization task assumes an existing classification, or cluster¬ 
ing, of documents. By contrast, the task of document clustering is to create, 
or discover, a reasonable set of clusters for a given set of documents. As was 
the case word sense discovery, a reasonable cluster is defined as one that 
maximizes the within-cluster document similarity, and minimizes between- 
cluster similarity. There are two principal motivations for the use of this 
technique in an ad hoc retrieval setting: efficiency, and the cluster hypothe¬ 
sis. 

The efficiency motivation arises from the enormous size of many mod¬ 
em document collections. Recall that the retrieval method described in the 
last section requires every query to be compared against every document in 
the collection. If a collection can be divided up into a set of N conceptually 
coherent clusters, then queries could first be compared against representa¬ 
tions of each of the N clusters. Ordinary retrieval could then be applied only 
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within the top cluster or clusters, thus saving the cost of comparing the query 
to the documents in all of the other more distant clusters. 

The cluster hypothesis (Jardine and van Rijsbergen, 1971) takes this hypothesis 
argument a step further by asserting that retrieval from a clustered collection 
will not only be more efficient, but will in fact improve retrieval performance 
in terms of recall and precision. The basic notion behind this hypothesis is 
that by separating documents according to topic, relevant documents will 
be found together in the same cluster, and non-relevant documents will be 
avoided since they will be reside in clusters that are not used for retrieval. 

Despite the plausibility of this hypothesis, there is only mixed experimental 
support for it. Results vary considerably based on the clustering algorithm 
and document collection in use (Willett, 1988; Shaw et al , 1996). 

Finally, in text segmentation, larger documents are automatically bro- Staton 
ken down into smaller semantically coherent chunks. This is useful in do¬ 
mains where there are a significant number of large documents that cover 
a wide variety of topics. Text segmentation can be used to either perform 
retrieval below the document level, or to visually guide the user to relevant 
parts of retrieved documents. Again, not surprisingly, segmentation algo¬ 
rithms often make use of vector-like representations for the subparts of a 
larger document. Adjacent subparts that have similar cosines are more likely 
to about the same topic than adjacent segments with more distant cosines. 

Roughly speaking, such discontinuities in the similarity between adjacent 
text segments can be used to divide larger documents into subparts (Salton 
etal, 1993; Hearst, 1997). 


17.5 Summary 

This chapter has explored two major areas of lexical semantic processing: 
word sense disambiguation and information retrieval. 

• Word sense disambiguation systems assign word tokens in context to 
one of a pre-specified set of senses. 

• Selection restriction-based approaches can be used to disambiguate 
both predicates and arguments. 

• Selection restriction-based methods require considerable information 
about semantic roles restrictions and hierarchical type information about 
role fillers. 
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• Machine learning approaches to sense disambiguation make it possible 
to automatically create robust sense disambiguation systems. 

• Supervised approaches use collections of texts annotated with their 
correct senses to train classifiers. 

• Bootstrapping approaches permit the use of supervised methods with 
far fewer resources. 

• Unsupervised, clustering-based, approaches attempt to discover repre¬ 
sentations of word senses from unannotated texts. 

• Machine readable dictionaries facilitate the creation of broad-coverage 
sense disambiguators. 

• The dominant models of information retrieval represent the meanings 
of documents and queries as bags of words. 

• The vector space model views documents and queries as vectors in a 
large multidimensional space. 

• The similarity between documents and queries, or other documents, 
can be measured by the cosine of the angle between the vectors. 

• The values of the features of vectors is based on a combination of the 
frequency of terms within a document and the distribution of terms 
across the document. 

• Polysemy and synonymy wreak havoc with word-based information 
retrieval systems, reducing both precision and recall. 

• User queries can be improved through query reformulation using either 
relevance feedback or thesaurus-based query expansion. 


Bibliographical and Historical Notes 

Word sense disambiguation traces its roots to some of the earliest applica¬ 
tions of digital computers. The notion of disambiguating a word by looking 
at small window around it was apparently first suggested by Warren Weaver 
(1955b), in the context of machine translation. Among the notions first pro¬ 
posed in this early period were the use of a thesaurus for disambiguation 
(Masterman, 1957), supervised training of Bayesian models for disambigua¬ 
tion (Madhu and Lytel, 1965), and the use of clustering in word sense anal¬ 
ysis (Sparck Jones, 1986). 

An enormous amount of work on disambiguation has been conducted 
within the context of Al-oriented natural language processing systems. It is 
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fair to say that most natural language analysis systems of this type exhibit 
some form of lexical disambiguation capability. However, a number of these 
efforts made word sense disambiguation a larger focus of their work. Among 
the most influential efforts were the efforts of Quillian (1968) and Simmons 
(1973b) with semantic networks, the work of Wilks with Preference Seman¬ 
tics (Wilks, 1975c, 1975b, 1975a)ks75, and the work of Small and Rieger 
(1982) and Riesbeck (1975) on word-based understanding systems. Hirst’s 
absity system (Hirst and Chamiak, 1982; Hirst, 1986, 1988), which used 
a technique based on semantic networks called marker passing, represents 
the most advanced system of this type. As with these largely symbolic ap¬ 
proaches, most connectionist approaches to word sense disambiguation have 
relied on small lexicons with hand-coded representations (Cottrell, 1985; 
Kawamoto, 1988). 

We should note that considerable work on sense disambiguation has 
been conducted in the areas of Cognitive Science and psycholinguistics. Ap¬ 
propriately enough, it is generally described using a different name: lexical 
ambiguity resolution. Small et al (1988) present a variety of papers from 
this perspective. 

The earliest implementation of a robust empirical approach to sense 
disambiguation is due to Kelly and Stone (1975) who directed a team of 
that hand-crafted a set of disambiguation rules for 1790 ambiguous English 
words. Lesk (1986) was the first to use a machine readable dictionary for 
word sense disambiguation. The efforts at New Mexico State University 
using LDOCE are among the most extensive explorations of the use of ma¬ 
chine readable dictionaries. Much of this work is described in (Wilks et al , 
1996). The problem of dictionary senses being too fine-grained or lacking 
an appropriate organization has been addressed in the work of (Dolan, 1994) 
and (Chen and Chang, 1998). 

Modem interest in supervised machine learning approaches to disam¬ 
biguation began with Black (1988), who applied decision tree learning to the 
task. The need for large amounts of annotated text in these methods led to in¬ 
vestigations into the use of bootstrapping methods (Hearst, 1991; Yarowsky, 
1995). The problem of how to weight and combine the disparate sources of 
evidence used in many robust systems is explored in (Ng and Lee, 1996) and 
(McRoy, 1992). There has been considerably less work in the area of unsu¬ 
pervised methods. The earliest attempt attempt to use clustering in the study 
of word senses is due to (Sparck Jones, 1986). Zemik (1991) successfully 
applied a standard information retrieval clustering algorithm to the problem, 
and provided an evaluation based on improvements in retrieval performance. 



658 


Chapter 17. Word Sense Disambiguation and Information Retrieval 


PROBABILIS¬ 

TIC 

MODEL 


More extensive recent work on clustering can be found in (Pedersen and 
Bruce, 1997; Schiitze, 1997, 1998). 

Note that of all of these robust efforts, only three have attempted to ex¬ 
ploit the power of mutually disambiguating all the words in a sentence. The 
system described in (Kelly and Stone, 1975) makes multiple passes over a 
sentence to take later advantage of easily disambiguated words; Cowie et al 
(1992) use a simulated annealing model to perform a parallel search for a 
desirable set of senses; Yeronis and Ide (1990) use inhibition and excita¬ 
tion in a neural network automatically constructed from a machine readable 
dictionary. 

Ide and Yeronis (1998) provide a comprehensive review of the history 
and current state of word sense disambiguation. (Ng and Zelle, 1997) pro¬ 
vide a more focused review from a machine learning perspective. Wilks 
et al (1996) describe a wide array of dictionary and corpus-based experi¬ 
ments, along with detailed descriptions of some very early work. 

Luhn (1957) is generally credited with first advancing the notion of 
fully automatic indexing of documents based on their contents. Over the 
years Salton’s SMART project (Salton, 1971) at Cornell developed or eval¬ 
uated many of the most important notions in information retrieval including 
the vector model, term weighting schemes, relevance feedback, and the use 
of cosine as a similarity metric. The notion of using inverse document fre¬ 
quency in term weighting is due to (Sparck Jones, 1972). The original notion 
of relevance feedback is due to (Rocchio, 1971). An alternative to the vec¬ 
tor model that we have not covered is the probabilistic model. Originally 
shown effective by Robinson and Sparck Jones (1976), a Bayesian network 
version of the probabilistic model is the basis for the widely used inquery 
system (Callan et al , 1992). 

The cluster hypothesis was introduced in (Jardine and van Rijsber- 
gen, 1971). Willett (1988) provides a critical review of the major efforts 
in this area. Mather (1998) presents an algorithm-independent clustering 
metric that can be used to evaluate the performance of various clustering al¬ 
gorithms. A collection of papers on document categorization and its close 
siblings, filtering and routing, can be found in (Lewis and Hayes, 1994). Text 
segmentation has generally been investigated from one of two perspectives: 
approaches based on strong theories of discourse structure, and approaches 
based on lexical text cohesion (Morris and Hirst, 1991). Hearst (1997) de¬ 
scribes a robust technique based on a vector model of lexical cohesion. Tech¬ 
niques based on strong discourse-models are discussed in Chapter 18 and 
Chapter 20. 



Section 17.5. Summary 


659 


An important extension of the vector space model known as Latent 
Semantic Indexing (LSI) (Deerwester el al , 1990) uses the singular value semantic 
decomposition method as means of reducing the dimensionality of vector 
models with the intent of discovering higher-order regularities in the original 
term-by-document matrix. Although LSI began life as a retrieval method, it 
has been applied to a wide variety of applications including models of lexical 
acquisition (Landauer andDumais, 1997), question answering (Jones, 1997), 
and most recently, student essay grading (Landauer et al , 1997). 

Baeza-Yates and Ribeiro-Neto (1999) is a comprehensive text cover¬ 
ing many of newest advances and trends in information retrieval. Frakes 
and Baeza-Yates (1992) is a more nuts and bolts text which includes a con¬ 
siderable amount of useful C code. Older classic texts include (Salton and 
McGill, 1983) and (van Rijsbergen, 1975). (Sparck Jones and Willett, 1997) 
includes many of the classic papers in the field. Current work is often pub¬ 
lished in the annual proceedings of the ACM Special Interest Group on In¬ 
formation Retrieval (SIGIR). The periodic TREC conference proceedings 
contain results from standardized evaluations organized by the U.S. govern¬ 
ment. The primary journals in the field are the Journal of the American 
Society of Information Sciences , ACM Transactions on Information Systems , 
Information Processing and Management , and Information Retrieval. 


Exercises 




Pragmatics is the study of (some parts of) the relation between 
language and context-of-use. Context-of-use includes such things as 
the identities of people and objects, and so pragmatics includes stud¬ 
ies of how language is used to refer (and re-refer) to people and things. 
Context-of-use includes the discourse context, and so pragmatics in¬ 
cludes studies of how discourses are structured, and how the listener 
manages to interpret a conversational partner in a conversation. This 
section explores algorithms for reference resolution, computational 
models for recovering the structure of monologue and conversational 
discourse, and models of how utterances in dialog are interpreted. 
This section also discusses the role of each of these models in build¬ 
ing a conversational agent, as well as the design of the dialog manager 
component of such an agent. Finally, the section introduces natural 
language generation, focusing especially on the function of discourse. 





DISCOURSE* 



Gracie: Oh yeah... And then Mr. and Mrs. Jones were having 
matrimonial trouble, and my brother was hired to watch Mrs. 
Jones. 

George: Well, I imagine she was a very attractive woman. 
Gracie: She was, and my brother watched her day and night for 
six months. 

George: Well, what happened? 

Gracie: She finally got a divorce. 

George: Mrs. Jones? 

Gracie: No, my brother’s wife. 

George Bums and Gracie Allen in The Salesgirl 


Up to this point of the book, we have focused primarily on language 
phenomena that operate at the word or sentence level. Of course, language 
does not normally consist of isolated, unrelated sentences, but instead of 
collocated, related groups of sentences. We refer to such a group of sentences 

as a discourse. 

The chapter you are now reading is an example of a discourse. It is in 
fact a discourse of a particular sort: a monologue. Monologues are charac¬ 
terized by a speaker (a term which will be used to include writers, as it is 
here), and a hearer (which, analogously, includes readers). The communi¬ 
cation flows in only one direction in a monologue, that is, from the speaker 
to the hearer. 

After reading this chapter, you may have a conversation with a friend 
about it, which would consist of a much freer interchange. Such a discourse 
is called a dialogue. In this case, each participant periodically takes turns 

*This chapter by Andrew Kehler 


DISCOURSE 

MONOLOGUE 

DIALOGUE 
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being a speaker and hearer. Unlike a typical monologue, dialogues generally 
consist of many different types of communicative acts: asking questions, 
giving answers, making corrections, and so forth. 

Finally, computer systems exist and continue to be developed that al- 
hci low for human-computer interaction , or HCI. HCI has properties that distin¬ 

guish it from normal human-human dialogue, in part due to the present-day 
limitations on the ability of computer systems to participate in free, uncon¬ 
strained conversation. A system capable of HCI will often employ a strategy 
to constrain the conversation in ways that allow it to understand the user’s 
utterances within a limited context of interpretation. 

While many discourse processing problems are common to these three 
forms of discourse, they differ in enough respects that different techniques 
have often been used to process them. This chapter focuses on techniques 
commonly applied to the interpretation of monologues; techniques for dia¬ 
logue interpretation and HCI will be described in Chapter 19. 

Language is rife with phenomena that operate at the discourse level. 
Consider the discourse shown in example (18.1). 

(18.1) John went to Bill’s car dealership to check out an Acura Integra. He 
looked at it for about an hour. 

What do pronouns such as he and it denote? No doubt that the reader had 
little trouble figuring out that he denotes John and not Bill, and that it denotes 
the Integra and not Bill’s car dealership. On the other hand, toward the end 
of the exchange presented at the beginning of this chapter, it appears that 
George had some trouble figuring out who Gracie meant when saying she. 

What differentiates these two examples? How do hearers interpret dis¬ 
course (18.1) with such ease? Can we build a computational model of this 
process? These are the types of questions we address in this chapter. In Sec¬ 
tion 18.1, we describe methods for interpreting referring expressions such as 
pronouns. We then address the problem of establishing the coherence of a 
discourse in Section 18.2. Finally, in Section 18.3 we explain methods for 
determining the structure of a discourse. 

Because discourse-level phenomena are ubiquitous in language, algo¬ 
rithms for resolving them are essential for a wide range of language appli¬ 
cations. For instance, interactions with query interfaces and dialogue inter¬ 
pretation systems like ATIS (see Chapter 9) frequently contain pronouns and 
similar types of expressions. So when a user spoke passage (18.2) to an ATIS 
system, 
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(18.2) I’d like to get from Boston to San Francisco, on either December 5th 
or December 6th. It’s okay if it stops in another city along the way. 

the system had to figure out that it denotes the flight that the user wants to 
book in order to perform the appropriate action. 

Similarly, information extraction systems (see Chapter 15) must fre¬ 
quently extract information from utterances that contain pronouns. For in¬ 
stance, if an information extraction system is confronted with passage (18.3), 

(18.3) First Union Corp is continuing to wrestle with severe problems 
unleashed by a botched merger and a troubled business strategy. 

According to industry insiders at Paine Webber, their president, John 
R. Georgius, is planning to retire by the end of the year. 

it must correctly identify First Union Corp as the denotation of their (as 
opposed to Paine Webber , for instance) in order to extract the correct event. 

Likewise, many text summarization systems employ a procedure for 
selecting the important sentences from a source document and using them 
to form a summary. Consider, for example, a news article that contains pas¬ 
sage (18.3). Such a system might determine that the second sentence is 
important enough to be included in the summary, but not the first. How¬ 
ever, the second sentence contains a pronoun that is dependent on the first 
sentence, so it cannot place the second sentence in the summary without first 
determining the pronoun’s denotation, as the pronoun would otherwise likely 
receive a different interpretation within the summary. Similarly, natural lan¬ 
guage generation systems (see Chapter 20) must have adequate models for 
pronominalization to produce coherent and interpretable discourse. In short, 
just about any conceivable language processing application requires methods 
for determining the denotations of pronouns and related expressions. 

18.1 Reference Resolution 

In this section we study the problem of reference, the process by which reference 
speakers use expressions like John and he in passage (18.1) to denote a per¬ 
son named John. Our discussion requires that we first define some termi¬ 
nology. A natural language expression used to perform reference is called a 
referring expression, and the entity that is referred to is called the referent, expression 
Thus, John and he in passage (18.1) are referring expressions, and John is referent 
their referent. (To distinguish between referring expressions and their refer¬ 
ents, we italicize the former.) As a convenient shorthand, we will sometimes 
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speak of a referring expression referring to a referent, e.g., we might say 
that he refers to John. However, the reader should keep in mind that what 
we really mean is that the speaker is performing the act of referring to John 
by uttering he. Two referring expressions that are used to refer to the same 
entity are said to corefer, thus John and he corefer in passage (18.1). There 
is also a term for a referring expression that licenses the use of another, in 
the way that the mention of John allows John to be subsequently referred to 
using he. We call John the antecedent of he. Reference to an entity that 
has been previously introduced into the discourse is called anaphora, and 
the referring expression used is said to be anaphoric. In passage (18.1), the 
pronouns he and it are therefore anaphoric. 

Natural languages provide speakers with a variety of ways to refer to 
entities. Say that your friend has an Acura Integra automobile and you want 
to refer to it. Depending on the operative discourse context, you might 
say it, this, that, this car, that car, the car , the Acura , the Integra , or my 
friend's car , among many other possibilities. However, you are not free to 
choose between any of these alternatives in any context. For instance, you 
cannot simply say it or the Acura if the hearer has no prior knowledge of your 
friend’s car, it has not been mentioned before, and it is not in the immediate 
surroundings of the discourse participants (i.e., the situational context of 
the discourse). 

The reason for this is that each type of referring expression encodes dif¬ 
ferent signals about the place that the speaker believes the referent occupies 
within the hearer’s set of beliefs. A subset of these beliefs that has a spe¬ 
cial status form the hearer’s mental model of the ongoing discourse, which 
we call a discourse model (Webber, 1978). The discourse model contains 
representations of the entities that have been referred to in the discourse and 
the relationships in which they participate. Thus, there are two components 
required by a system to successfully produce and interpret referring expres¬ 
sions: a method for constructing a discourse model that evolves with the 
dynamically-changing discourse it represents, and a method for mapping be¬ 
tween the signals that various referring expressions encode and the hearer’s 
set of beliefs, the latter of which includes this discourse model. 

We will speak in terms of two fundamental operations to the discourse 
model. When a referent is first mentioned in a discourse, we say that a rep¬ 
resentation for it is evoked into the model. Upon subsequent mention, this 
representation is accessed from the model. The operations and relationships 
are illustrated in Figure 18.1. 

We will restrict our discussion to reference to entities, although dis- 
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a new Acura Integra it 

corefer 


Figure 18.1 Reference operations and relationships. THIS PICTURE TO 

BE REPLACED. 


courses include reference to many other types of referents. Consider the 
possibilities in example (18.4), adapted from Webber (1991). 

(18.4) According to John, Bob bought Sue an Integra, and Sue bought Fred 
a Legend. 

a. But that turned out to be a lie. 

b. But that was false. 

c. That struck me as a funny way to describe the situation. 

d. That caused Sue to become rather poor. 

e. That caused them both to become rather poor. 

The referent of that is a speech act (see Chapter 19) in (18.4a), a proposition 
in (18.4b), a manner of description in (18.4c), an event in (18.4d), and a 
combination of several events in (18.4e). The field awaits the development 
of robust methods for interpreting these types of reference. 

Reference Phenomena 

The set of referential phenomena that natural languages provide is quite rich 
indeed. In this section, we provide a brief description of several basic ref¬ 
erence phenomena. We first survey five types of referring expression: in¬ 
definite noun phrases , definite noun phrases , pronouns , demonstratives , and 
one-anaphora . We then describe three types of referents that complicate the 
reference resolution problem: infer rabies, discontinuous sets , and generics. 

Indefinite Noun Phrases Indefinite reference introduces entities that are 
new to the hearer into the discourse context. The most common form of 
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indefinite reference is marked with the determiner a (or an), as in (18.5), 
but it can also be marked by a quantifier such as some (18.6) or even the 
determiner this (18.7). 

(18.5) I saw an Acura Integra today. 

(18.6) Some Acura Integras were being unloaded at the local dealership 
today. 

(18.7) I saw this awesome Acura Integra today. 

Such noun phrases evoke a representation for a new entity that satisfies the 
given description into the discourse model. 

The indefinite determiner a does not indicate whether the entity is iden¬ 
tifiable to the speaker, which in some cases leads to a specific!non-specific 
ambiguity. Example (18.5) only has the specific reading, since the speaker 
has a particular Integra in mind, particularly the one she saw. In sentence 

(18.8) , on the other hand, both readings are possible. 

(18.8) lam going to the dealership to buy an Acura Integra today. 

That is, the speaker may already have the Integra picked out (specific), or 
may just be planning to pick one out that is to her liking (nonspecific). The 
readings may be disambiguated by a subsequent referring expression in some 
contexts; if this expression is definite then the reading is specific (/ hope 
they still have it), and if it is indefinite then the reading is nonspecific (/ 
hope they have a car I like). This rule has exceptions, however; for instance 
definite expressions in certain modal contexts (/ will park it in my garage) 
are compatible with the nonspecific reading. 

Definite Noun Phrases Definite reference is used to refer to an entity that 
is identifiable to the hearer, either because it has already been mentioned in 
the discourse context (and thus is represented in the discourse model), it is 
contained in the hearer’s set of beliefs about the world, or the uniqueness of 
the object is implied by the description itself. 

The case in which the referent is identifiable from discourse context is 
shown in (18.9). 

(18.9) I saw an Acura Integra today. The Integra was white and needed to 
be washed. 

Examples in which the referent is either identifiable from the hearer’s 
set of beliefs or is inherently unique are shown in (18.10) and (18.11) re¬ 
spectively. 

(18.10) The Indianapolis 500 is the most popular car race in the US. 
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(18.11) The fastest car in the Indianapolis 500 was an Integra. 

Definite noun phrase reference requires that an entity be accessed from either 
the discourse model or the hearer’s set of beliefs about the world. In the latter 
case, it also evokes a representation of the referent into the discourse model. 

Pronouns Another form of definite reference is pronominalization, illus¬ 
trated in example (18.12). 

(18.12) I saw an Acura Integra today. It was white and needed to be 
washed. 

The constraints on using pronominal reference are stronger than for full defi¬ 
nite noun phrases, requiring that the referent have a high degree of activation 
or salience in the discourse model. Pronouns usually (but not always) refer salience 
to entities that were introduced no further than one or two sentences back in 
the ongoing discourse, whereas definite noun phrases can often refer further 
back. This is illustrated by the difference between sentences (18.13d) and 
(18.13d’). 

(18.13) a. John went to Bob’s party, and parked next to a beautiful Acura 
Integra. 

b. He went inside and talked to Bob for more than an hour. 

c. Bob told him that he recently got engaged. 

d. ?? He also said that he bought it yesterday. 

d.’ He also said that he bought the Acura yesterday. 

By the time the last sentence is reached, the Integra no longer has the degree 
of salience required to allow for pronominal reference to it. 

Pronouns can also participate in cataphora, in which they are men- cataphora 
tioned before their referents are, as in example (18.14). 

(18.14) Before he bought it, John checked over the Integra very carefully. 

Here, the pronouns he and it both occur before their referents are introduced. 

Pronouns also appear in quantified contexts in which they are consid¬ 
ered to be bound, as in example (18.15). bound 

(18.15) Every woman bought her Acura at the local dealership. 

Under the relevant reading, her does not refer to some woman in context, 
but instead behaves like a variable bound to the quantified expression every 
woman. We will not be concerned with the bound interpretation of pronouns 
in this chapter. 
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Demonstratives Demonstrative pronouns, like this and that , behave some¬ 
what differently that simple definite pronouns like it They can appear either 
alone or as determiners, for instance, this Acura , that Acura. The choice be¬ 
tween two demonstratives is generally associated with some notion of spa¬ 
tial proximity: this indicating closeness and that signaling distance. Spatial 
distance might be measured with respect to the discourse participants’ situa¬ 
tional context, as in (18.16). 

(18.16) [John shows Bob an Acura Integra and a Mazda Miata] 

Bob (pointing): I like this better than that. 

Alternatively, distance can be metaphorically interpreted in terms of con¬ 
ceptual relations in the discourse model. For instance, consider example 

(18.17) . 

(18.17) I bought an Integra yesterday. It’s similar to the one I bought five 
years ago. That one was really nice, but I like this one even better. 

Here, that one refers to the Acura bought five years ago (greater temporal 
distance), whereas this one refers to the one bought yesterday (closer tem¬ 
poral distance). 

One Anaphora One- anaphora, exemplified in (18.18), blends properties 
of definite and indefinite reference. 

(18.18) I saw no less than 6 Acura Integras today. Now I want one. 

This use of one can be roughly paraphrased by one of them, in which 
them refers to a plural referent (or generic one, as in the case of (18.18), see 
below), and one selects a member from this set. Thus, one may evoke a new 
entity into the discourse model, but it is necessarily dependent on an existing 
referent for the description of this new entity. 

This use of one should be distinguished from the formal, non-specific 
pronoun usage in (18.19), and its meaning as the number one in (18.20). 

(18.19) One shouldn’t pay more than twenty thousand dollars for an Acura. 

(18.20) John has two Acuras, but I only have one. 

Inferrables Now that we have described several types of referring expres¬ 
sions, we now turn our attention to a few interesting types of referents that 
complicate the reference resolution problem. For instance, in some cases a 
referring expression does not refer to an entity that has been explicitly evoked 
in the text, but instead one that is inferentially related to an evoked entity. 
Such referents are called inferrables (Haviland and Clark, 1974; Prince, 
1981). Consider the expressions a door and the engine in sentence (18.21). 
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(18.21) I almost bought an Acura Integra today, but a door had a dent and 
the engine seemed noisy. 

The indefinite noun phrase a door would normally introduce a new door into 
the discourse context, but in this case the hearer is to infer something more: 
that it is not just any door, but one of the doors of the Integra. Similarly, the 
use of the definite noun phrase the engine normally presumes that an engine 
has been previously evoked or is otherwise uniquely identifiable. Here, no 
engine has been explicitly mentioned, but the hearer infers that the referent 
is the engine of the previously mentioned Integra. 

Inferrables can also specify the results of processes described by ut¬ 
terances in a discourse. Consider the possible follow-ons (a-c) to sentence 

(18.22) in the following recipe (from Webber and Baldwin (1992)): 

(18.22) Mix the flour, butter, and water. 

a. Kneed the dough until smooth and shiny. 

b. Spread the paste over the blueberries. 

c. Stir the batter until all lumps are gone. 

Any of the expressions the dough (a solid), the batter (a liquid), and the 
paste (somewhere in between) can be used to refer to the result of the actions 
described in the first sentence, but all imply different properties of this result. 

Discontinuous Sets In some cases, references using plural referring ex¬ 
pressions like they and them (see page 672) refer to sets of entities that are 
evoked together, for instance, using another plural expression (their Acuras ) 
or a conjoined noun phrase (John and Mary): 

(18.23) John and Mary love their Acuras. They drive them all the time. 

However, plural references may also refer to sets of entities that have 
been evoked by discontinuous phrases in the text: 

(18.24) John has an Acura, and Mary has a Mazda. They drive them all the 
time. 

Here, they refers to John and Mary, and likewise them refers to the Acura 
and the Mazda. Note also that the second sentence in this case will gener¬ 
ally receive what is called a pairwise or respectively reading, in which John 
drives the Acura and Mary drives the Mazda, as opposed to the reading in 
which they both drive both cars. 

Generics Making the reference problem even more complicated is the ex¬ 
istence of generic reference. Consider example (18.25). 
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(18.25) I saw no less than 6 Acura Integras today. They are the coolest cars. 

Here, the most natural reading is not the one in which they refers to the 
particular 6 Integras mentioned in the first sentence, but instead to the class 
of Integras in general. 

Syntactic and Semantic Constraints on Coreference 

Having described a variety of reference phenomena that are found in natu¬ 
ral language, we can now consider how one might develop algorithms for 
identifying the referents of referential expressions. One step that needs to be 
taken in any successful reference resolution algorithm is to filter the set of 
possible referents on the basis of certain relatively hard-and-fast constraints. 
We describe some of these constraints here. 

Number Agreement Referring expressions and their referents must agree 
in number; for English, this means distinguishing between singular and plu¬ 
ral references. A categorization of pronouns with respect to number is shown 
in Figure 18.2. 


Singular 

Plural 

Unspecified 

she, her, he, him, his, it 

we, us, they, them 

you 

Figure 18.2 Number agreement in the English pronominal system. 


The following examples illustrate constraints on number agreement. 

(18.26) John has a new Acura. It is red. 

(18.27) John has three new Acuras. They are red. 

(18.28) * John has a new Acura. They are red. 

(18.29) * John has three new Acuras. It is red. 

Person and Case Agreement English distinguishes between three forms 
of person: first, second, and third. A categorization of pronoun types with 
respect to person is shown in Figure 18.3. 

The following examples illustrate constraints on person agreement. 

(18.30) You and I have Acuras. We love them. 

(18.31) John and Mary have Acuras. They love them. 

(18.32) * John and Mary have Acuras. We love them, (where We =John and 
Mary) 

(18.33) * You and I have Acuras. They love them, (where They=You and I) 
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First 

Second 

Third 

Nominative 

I, we 

you 

he, she, they 

Accusative 

me, us 

you 

him, her, them 

Genitive 

my, our 

your 

his, her, their 

Figure 18.3 Person and case agreement in the English pronominal system/ 


In addition, English pronouns are constrained by case agreement; dif¬ 
ferent forms of the pronoun may be required when placed in subject position 
(nominative case, e.g., he, she, they), object position (accusative case, e.g., 
him, her, them), and genitive position (genitive case, e.g., his Acura, her 
Acura, their Acura). This categorization is also shown in Figure 18.3. 

Gender Agreement Referents also must agree with the gender specified 
by the referring expression. English third person pronouns distinguish be¬ 
tween male, female, and nonpersonal genders, and unlike many languages, 
the first two only apply to animate entities. Some examples are shown in 
Figure 18.4. 


masculine 

feminine 

nonpersonal 

he, him, his 

she, her 

it 

Figure 18.4 Gender agreement in the English pronominal system. 


The following examples illustrate constraints on gender agreement. 

(18.34) John has an Acura. He is attractive. (he=John, not the Acura) 

(18.35) John has an Acura. It is attractive. (it=thc Acura, not John) 

Syntactic Constraints Reference relations may also be constrained by the 
syntactic relationships between a referential expression and a possible an¬ 
tecedent noun phrase when both occur in the same sentence. For instance, 
the pronouns in all of the following sentences are subject to the constraints 
indicated in brackets. 

(18.36) John bought himself a new Acura. [himself=John] 

(18.37) John bought him a new Acura. [him^John] 

(18.38) John said that Bill bought him a new Acura. [him^Bill] 

(18.39) John said that Bill bought himself a new Acura. [himself=Bill] 

(18.40) He said that he bought John a new Acura. [He^John;he^John] 

English pronouns such as himself, herself, and themselves are called 
reflexives. Oversimplifying the situation considerably, a reflexive corefers reflexives 
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with the subject of the most immediate clause that contains it (ex. 18.36), 
whereas a nonreflexive cannot corefer with this subject (ex. 18.37). That 
this rule applies only for the subject of the most immediate clause is shown 
by examples (18.38) and (18.39), in which the opposite reference pattern is 
manifest between the pronoun and the subject of the higher sentence. On the 
other hand, a full noun phrase like John cannot corefer with the subject of 
the most immediate clause nor with a higher-level subject (ex. 18.40). 

Whereas these syntactic constraints apply to a referring expression 
and a particular potential antecedent noun phrase, these constraints actually 
prohibit coreference between the two regardless of any other available an¬ 
tecedents that denote the same entity. For instance, normally a nonreflexive 
pronoun like him can corefer with the subject of the previous sentence as 
it does in example (18.41), but it cannot in example (18.42) because of the 
existence of the coreferential pronoun he in the second clause. 

(18.41) John wanted a new car. Bill bought him a new Acura. [him=John] 

(18.42) John wanted a new car. He bought him a new Acura. 
[he=John;him^John] 

The rules given above oversimplify the situation in a number of ways, 
and there are many cases that they do not cover. Indeed, upon further inspec¬ 
tion the facts actually get quite complicated. In fact, it is unlikely that all of 
the data can be explained using only syntactic relations (Kuno, 1987). For 
instance, the reflexive himself and the nonreflexive him in sentences (18.43) 
and (18.44) respectively can both refer to the subject John, even though they 
occur in identical syntactic configurations. 

(18.43) John set the pamphlets about Acuras next to himself. 

[himself=John] 

(18.44) John set the pamphlets about Acuras next to him. [him=John] 

For the algorithms discussed later in this chapter, however, we will assume a 
syntactic account of restrictions on intrasentential coreference. 

Selectional Restrictions The selectional restrictions that a verb places on 
its arguments (see Chapter 16) may be responsible for eliminating referents, 
as in example (18.45). 

(18.45) John parked his Acura in the garage. He had driven it around for 
hours. 

There are two possible referents for it, the Acura and the garage. The verb 
drive, however, requires that its direct object denote something that can be 
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driven, such as a car, truck, or bus, but not a garage. Thus, the fact that the 
pronoun appears as the object of drive restricts the set of possible referents 
to the Acura. It is conceivable that a practical NLP system would include a 
reasonably comprehensive set of selectional constraints for the verbs in its 
lexicon. 

Selectional restrictions can be violated in the case of metaphor (see 
Chapter 16); for example, consider example (18.46). 

(18.46) John bought a new Acura. It drinks gasoline like you would not 
believe. 

While the verb drink does not usually take an inanimate subject, its metaphor¬ 
ical use here allows it to refer to a new Acura. 

Of course, there are more general semantic constraints that may come 
into play, but these are much more difficult to encode in a comprehensive 
manner. Consider passage (18.47). 

(18.47) John parked his Acura in the garage. It is incredibly messy, with 
old bike and car parts lying around everywhere. 

Here the referent of it is almost certainly the garage, but only because a car 
is probably too small to have bike and car parts laying around ‘everywhere’. 
Resolving this reference requires that a system have knowledge about how 
large cars typically are, how large garages typically are, and the typical types 
of objects one might find in each. On the other hand, one’s knowledge about 
Beverly Hills might lead one to assume that the Acura is indeed the referent 
of it in passage (18.48). 

(18.48) John parked his Acura in downtown Beverly Hills. It is incredibly 
messy, with old bike and car parts lying around everywhere. 

In the end, just about any knowledge shared by the discourse participants 
might be necessary to resolve a pronoun reference. However, due in part to 
the vastness of such knowledge, practical algorithms typically do not rely on 
it heavily. 

Preferences in Pronoun Interpretation 

In the previous section, we discussed relatively strict constraints that algo¬ 
rithms should apply when determining possible referents for referring ex¬ 
pressions. We now discuss some more readily violated preferences that al¬ 
gorithms can be made to account for. These preferences have been posited to 
apply to pronoun interpretation in particular. Since the majority of work on 
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reference resolution algorithms has focused on pronoun interpretation, we 
will similarly focus on this problem in the remainder of this section. 

Recency Most theories of reference incorporate the notion that entities in¬ 
troduced in recent utterances are more salient than those introduced from 
utterances further back. Thus, in example (18.49), the pronoun it is more 
likely to refer to the Legend than the Integra. 

(18.49) John has an Integra. Bill has a Legend. Mary likes to drive it. 

Grammatical Role Many theories specify a salience hierarchy of entities 
that is ordered by the grammatical position of the expressions which denote 
them. These invariably treat entities mentioned in subject position as more 
salient than those in object position, which are in turn more salient than those 
mentioned in subsequent positions. 

Passages such as (18.50) and (18.51) lend support for such a hierar¬ 
chy. Although the first sentence in each case expresses roughly the same 
propositional content, the preferred referent for the pronoun him varies with 
the subject in each case - John in (18.50) and Bill in (18.51). In example 
(18.52), the references to John and Bill are conjoined within the subject po¬ 
sition. Since both seemingly have the same degree of salience, it is unclear 
to which the pronoun refers. 

(18.50) John went to the Acura dealership with Bill. He bought an Integra. 

[ he = John ] 

(18.51) Bill went to the Acura dealership with John. He bought an Integra. 

[ he = Bill ] 

(18.52) John and Bill went to the Acura dealership. He bought an Integra. 

[ he = ?? ]. 

Repeated Mention Some theories incorporate the idea that entities that 
have been focused on in the prior discourse are more likely to continue to 
be focused on in subsequent discourse, and hence references to them are 
more likely to be pronominalized. For instance, whereas the pronoun in 
example (18.51) has Bill as its preferred interpretation, the pronoun in the 
final sentence of example (18.53) is more likely to refer to John. 

(18.53) John needed a car to get to his new job. He decided that he wanted 
something sporty. Bill went to the Acura dealership with him. He 
bought an Integra. [ he = John ] 

Parallelism There are also strong preferences that appear to be induced by 
parallelism effects, as in example (18.54). 
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(18.54) Mary went with Sue to the Acura dealership. Sally went with her to 
the Mazda dealership. [ her = Sue ] 

The grammatical role hierarchy described above ranks Mary as more salient 
than Sue, and thus should be the preferred referent of her. Furthermore, there 
is no semantic reason that Mary cannot be the referent. Nonetheless, her is 
instead understood to refer to Sue. 

This suggests that we might want a heuristic which says that non¬ 
subject pronouns prefer non-subject referents. However, such a heuristic 
may not work for cases that lack the structural parallelism of example (18.54), 
such as example (18.55), in which Mary is the preferred referent of the pro¬ 
noun instead of Sue. 

(18.55) Mary went with Sue to the Acura dealership. Sally told her not to 
buy anything. [ her = Mary ] 

Verb Semantics Certain verbs appear to place a semantically-oriented em¬ 
phasis on one of their argument positions, which can have the effect of bi¬ 
asing the manner in which subsequent pronouns are interpreted. Compare 
sentences (18.56) and (18.57). 

(18.56) John telephoned Bill. He lost the pamphlet on Acuras. 

(18.57) John criticized Bill. He lost the pamphlet on Acuras. 

These examples differ only in the verb used in the first sentence, yet the 
subject pronoun in passage (18.56) is typically resolved to John, whereas 
the pronoun in passage (18.57) is resolved to Bill. Some researchers have 
claimed that this effect results from what has been called the ‘implicit causal¬ 
ity’ of a verb: the implicit cause of a ‘criticizing’ event is considered to be 
its object, whereas the implicit cause of a ‘telephoning’ event is considered 
to be its subject. This emphasis results in a higher degree of salience for the 
entity in this argument position, which leads to the different preferences for 
examples (18.56) and (18.57). 

Similar preferences have been articulated in terms of the thematic roles 
(see Chapter 16) that the potential antecedents occupy. For example, most 
hearers resolve He to John in example (18.58) and to Bill in example (18.59). 
Although these referents are evoked from different grammatical role po¬ 
sitions, they both fill the Goal thematic role of their corresponding verbs, 
whereas the other potential referent fills the Source. Likewise, hearers gen¬ 
erally resolve He to John and Bill in examples (18.60) and (18.61) respec¬ 
tively, providing evidence that fillers of the Stimulus role are preferred over 
fillers of the Experiencer role. 
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(18.58) John seized the Acura pamphlet from Bill. He loves reading about 
cars. (Goal=John, Source=Bill) 

(18.59) John passed the Acura pamphlet to Bill. He loves reading about 
cars. (Goal=Bill, Source=John) 

(18.60) The car dealer admired John. He knows Acuras inside and out. 
(Stimulus=John, Expcricncci-thc car dealer) 

(18.61) The car dealer impressed John. He knows Acuras inside and out. 
(Stimulus=the car dealer, Experiencer=John) 


SALIENCE 

VALUE 

SALIENCE 

FACTORS 


An Algorithm for Pronoun Resolution 

None of the algorithms for pronoun resolution that have been proposed to 
date successfully account for all of these preferences, let alone succeed in 
resolving the contradictions that will arise between them. However, Lappin 
and Leass (1994) describe a straightforward algorithm for pronoun inter¬ 
pretation that takes many of these into consideration. The algorithm em¬ 
ploys a simple weighting scheme that integrates the effects of the recency 
and syntactically-based preferences; no semantic preferences are employed 
beyond those enforced by agreement. We describe a slightly simplified por¬ 
tion of the algorithm that applies to non-reflexive, third person pronouns. 

Broadly speaking, there are two types of operations performed by the 
algorithm: discourse model update and pronoun resolution. First, when a 
noun phrase that evokes a new entity is encountered, a representation for it 
must be added to the discourse model and a degree of salience (which we 
call a salience value) computed for it. The salience value is calculated as 
the sum of the weights assigned by a set of salience factors. The salience 
factors used and their corresponding weights are shown in Figure 18.5. 


Sentence recency 

100 

Subject emphasis 

80 

Existential emphasis 

70 

Accusative (direct object) emphasis 

50 

Indirect object and oblique complement emphasis 

40 

Non-adverbial emphasis 

50 

Head noun emphasis 

80 

Figure 18.5 Salience factors in Lappin and Leass’s system. 


The weights that each factor assigns to an entity in the discourse model 
are cut in half each time a new sentence is processed. This, along with 
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the added effect of the sentence recency weight (which initially assigns a 
weight of 100, to be cut in half with each succeeding sentence), captures the 
Recency preference described on page 676, since referents mentioned in the 
current sentence will tend to have higher weights than those in the previous 
sentence, which will in turn be higher than those in the sentence before that, 
and so forth. 

Similarly, the next five factors in Figure 18.5 can be viewed as a way 
of encoding a grammatical role preference scheme using the following hier¬ 
archy: 

subject > existential predicate nominal > object > indirect ob¬ 
ject or oblique > demarcated adverbial PP 

These five positions are exemplified by the position of the italicized phrases 
in examples (18.62)-(18.66) respectively. 

(18.62) AnAcura Integra is parked in the lot. (subject) 

(18.63) There is an Acura Integra parked in the lot. (existential predicate 
nominal) 

(18.64) John parked an Acura Integra in the lot. (object) 

(18.65) John gave his Acura Integra a bath, (indirect object) 

(18.66) Inside his Acura Integra , John showed Susan his new CD player, 
(demarcated adverbial PP) 

The preference against referents in demarcated adverbial PPs (i.e., those sep¬ 
arated by punctuation, as with the comma in example (18.66)) is encoded as 
a positive weight of 50 for every other position, listed as the non-adverbial 
emphasis weight in Figure 18.5. This ensures that the weight for any ref¬ 
erent is always positive, which is necessary so that the effect of halving the 
weights is always to reduce them. 

The head noun emphasis factor penalizes referents which are embed¬ 
ded in larger noun phrases, again by promoting the weights of referents that 
are not. Thus, the Acura Integra in each of examples (18.62)-(18.66) will 
receive 80 points for being denoted by a head noun, whereas the Acura Inte¬ 
gra in example (18.67) will not, since it is embedded within the subject noun 
phrase. 

(18.67) The owner’s manual for an Acura Integra is on John’s desk. 

Each of these factors contributes to the salience of a referent based on 
the properties of the noun phrase that denotes it. Of course, it could be that 
several noun phrases in the preceding discourse refer to the same referent, 
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each being assigned a different level of salience, and thus we need a way 
in which to combine the contributions of each. To address this, Lappin and 
Leass associate with each referent an equivalence class that contains all of 
the noun phrases that have been determined to refer to it. The weight that 
a salience factor assigns to a referent is the highest of the weights it assigns 
to the members of its equivalence class. The salience weight for a refer¬ 
ent is then calculated by summing these weights for each factor. The scope 
of a salience factor is a sentence, so, for instance, if a potential referent is 
mentioned in the current sentence as well as the previous one, the sentence 
recency weight will be factored in for each. (On the other hand, if the same 
referent is mentioned more than once in the same sentence, this weight will 
be counted only once.) Thus, multiple mentions of a referent in the prior dis¬ 
course can potentially increase its salience, which has the effect of encoding 
the preference for repeated mentions discussed on page 676. 

Once we have updated the discourse model with new potential refer¬ 
ents and recalculated the salience values associated with them, we are ready 
to consider the process of resolving any pronouns that exist within a new 
sentence. In doing this, we factor in two more salience weights, one for 
grammatical role parallelism between the pronoun and the potential refer¬ 
ent, and one to disprefer cataphoric reference. The weights are shown in 
Figure 18.6. Unlike the other preferences, these two cannot be calculated 
independently of the pronoun, and thus cannot be calculated during the dis¬ 
course model update step. We will use the term initial salience value for the 
weight of a given referent before these factors are applied, and the term final 
salience value for after they have applied. 


Role Parallelism 

35 

Cataphora 

-175 

Figure 18.6 Per pronoun salience weights in Lappin and Leass’s system. 


We are now ready to specify the pronoun resolution algorithm. Assum¬ 
ing that the discourse model has been updated to reflect the initial salience 
values of referents as described above, the steps taken to resolve a pronoun 
are as follows: 

1. Collect the potential referents (up to four sentences back). 

2. Remove potential referents that do not agree in number or gender with 
the pronoun. 
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3. Remove potential referents that do not pass intrasentential syntactic 
coreference constraints (as described on page 673). 

4. Compute the total salience value of the referent by adding any appli¬ 
cable values from Figure 18.6 to the existing salience value previously 
computed during the discourse model update step (i.e., the sum of the 
applicable values in Figure 18.5). 

5. Select the referent with the highest salience value. In the case of ties, 
select the closest referent in terms of string position (computed without 
bias to direction). 

We illustrate the operation of the algorithm by stepping through exam¬ 
ple (18.68). 

(18.68) John saw a beautiful Acura Integra at the dealership. He showed it 
to Bob. He bought it. 

We first process the first sentence to collect potential referents and 
compute their initial salience values. The following table shows the con¬ 
tribution to salience from each of the salience factors. 



Rec 

Subj 

Exist 

Obj 

bid-Obj 

Non-Adv 

HeadN 

Total 

John 

100 

80 




50 

80 

310 

Integra 

100 



50 


50 

80 

280 

dealership 

100 





50 

80 

230 


There are no pronouns to be resolved in this sentence, so we move 
on to the next, degrading the above values by a factor of two as shown in 
the following table. The phrases column shows the equivalence class of 
referring expressions for each referent. 


Referent 

Phrases 

Value 

John 

{ John } 

155 

Integra 

{ a beautiful Acura Integra } 

140 

dealership 

{ the dealership } 

115 


The first noun phrase in the second sentence is the pronoun he. Because 
he specifies male gender, Step 2 of the resolution algorithm reduces the set 
of possible referents to include only John, so we can stop there and take this 
to be the referent. 
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The discourse model must now be updated. First, the pronoun he is 
added in the equivalence class for John. Since he occurs in the current sen¬ 
tence and John in the previous one, the salience factors do not overlap be¬ 
tween the two. The pronoun is in the current sentence (recency=100), subject 
position (=80), not in an adverbial (=50), and not embedded (=80), and so a 
total of 310 is added to the current weight for John: 


Referent 

Phrases 

Value 

John 

{ John , he\ } 

465 

Integra 

{ a beautiful Acura Integra } 

140 

dealership 

{ the dealership } 

115 


The next noun phrase in the second sentence is the pronoun it , which is 
compatible with the Integra or the dealership. We first need to compute the 
final salience values by adding the applicable weights from Figure 18.6 to 
the initial salience values above. Neither referent assignment would result in 
cataphora, so that factor does not apply. For the parallelism preference, both 
it and a beautiful Acura Integra are in object position within their respective 
sentences (whereas the dealership is not), so a weight of 35 is added to this 
option. With the Integra having a weight of 175 and the dealership a weight 
of 115, the Integra is taken to be the referent. 

Again, the discourse model must now be updated. Since it is in a 
nonembedded object position, it receives a weight of 100+50+50+80=280, 
and is added to the current weight for the Integra. 


Referent 

Phrases 

Value 

John 

{ John , he\ } 

465 

Integra 

{ a beautiful Acura Integra , it\ } 

420 

dealership 

{ the dealership } 

115 


The final noun phrase in the second sentence is Bob , which introduces 
a new discourse referent. Since it occupies an oblique argument position, it 
receives a weight of 100+40+50+80=270. 


Referent 

Phrases 

Value 

John 

{ John , he\ } 

465 

Integra 

{ a beautiful Acura Integra , it\ } 

420 

Bob 

{ Bob } 

270 

dealership 

{ the dealership } 

115 


Now we are ready to move on to the final sentence. We again degrade 
the current weights by one half. 
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Referent 

Phrases 

Value 

John 

{ John , he\ } 

232.5 

Integra 

{ a beautiful Acura Integra , it\ } 

210 

Bob 

{ Bob } 

135 

dealership 

{ the dealership } 

57.5 


The reader can confirm that the referent of he will be resolved to John, 
and the referent of it to the Integra. 

The weights used by Lappin and Leass were arrived at by experimenta¬ 
tion on a development corpus of computer training manuals. This algorithm, 
when combined with several filters not described here, achieved 86% accu¬ 
racy when applied to unseen test data within the same genre. It is possible 
that these exact weights may not be optimal for other genres (and even more 
so for other languages), so the reader may want to experiment with these on 
training data for a new application or language. 

In Exercise 18.7, we consider a version of the algorithm that relies 
only on a noun phrase identifier (see also Kennedy and Boguraev (1996)). In 
the next paragraphs, we briefly summarize two other approaches to pronoun 
resolution. 

A Tree Search Algorithm Hobbs (1978b) describes an algorithm for pro¬ 
noun resolution which takes the syntactic representations of the sentences up 
to and including the current sentence as input, and performs a search for an 
antecedent noun phrase on these trees. There is no explicit representation of 
a discourse model or preferences as in the Lappin and Leass algorithm. How¬ 
ever, certain of these preferences are approximated by the order in which the 
search on syntactic trees is performed. 

An algorithm that searches parse trees must also specify a grammar, 
since the assumptions regarding the structure of syntactic trees will affect 
the results. A fragment for English that the algorithm uses is given in Fig¬ 
ure 18.7. The steps of the algorithm are as follows. 

1. Begin at the noun phrase (NP) node immediately dominating the pro¬ 
noun. 

2. Go up the tree to the first NP or sentence (S) node encountered. Call 
this node X, and call the path used to reach it p. 

3. Traverse all branches below node X to the left of path p in a left-to- 
right, breadth-first fashion. Propose as the antecedent any NP node 
that is encountered which has an NP or S node between it and X. 

4. If node X is the highest S node in the sentence, traverse the surface 
parse trees of previous sentences in the text in order of recency, the 
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S -A NPVP 


NP -A 


(Det) Nominal 


y pronoun 

^ f determiner ) 

Det - { NP i ) 

PP —> preposition NP 
Nominal -A noun (PP)* 
Rel —> wh-wordS 
VP -A verb NP (PP)* 



Figure 18.7 A grammar fragment for the Tree Search algorithm. 


most recent first; each tree is traversed in a left-to-right, breadth-first 
manner, and when an NP node is encountered, it is proposed as an¬ 
tecedent. If X is not the highest S node in the sentence, continue to 
step 5. 

5. From node X, go up the tree to the first NP or S node encountered. Call 
this new node X, and call the path traversed to reach it p. 

6. If X is an NP node and if the path p to X did not pass through the Nom¬ 
inal node that X immediately dominates, propose X as the antecedent. 

7. Traverse all branches below node X to the left of path p in a left-to- 
right, breadth-first manner. Propose any NP node encountered as the 
antecedent. 

8. If X is an S node, traverse all branches of node X to the right of path 
p in a left-to-right, breadth-first manner, but do not go below any NP 
or S node encountered. Propose any NP node encountered as the an¬ 
tecedent. 

9. Go to Step 4. 

Demonstrating that this algorithm yields the correct coreference assignments 
for example (18.68) is left as Exercise 18.3. 

As stated, the algorithm depends on complete and correct syntactic 
structures as input. Hobbs evaluated his approach manually (with respect 
to both parse construction and algorithm implementation) on one hundred 
examples from each of three different texts, reporting an accuracy of 88.3%. 
(The accuracy increases to 91.7% if certain selectional restriction constraints 
are assumed.) Lappin and Leass encoded a version of this algorithm within 
their system, and reported an accuracy of 82% on their test corpus. Although 
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this is less than the 86% accuracy achieved by their own algorithm, it should 
be borne in mind that the test data Lappin and Leass used was from the same 
genre as their development set, but different than the genres that Hobbs used 
in developing his algorithm. 

A Centering Algorithm As we described above, the Hobbs algorithm 
does not use an explicit representation of a discourse model. The Lappin 
and Leass algorithm does, but encodes salience as a weighted combination 
of preferences. Centering theory (Grosz et al, 1995, henceforth GJW), also 
has an explicit representation of a discourse model, and incorporates an ad¬ 
ditional claim: that there is a single entity being ‘centered’ on at any given 
point in the discourse which is to be distinguished from all other entities that 
have been evoked. 

There are two main representations tracked in the discourse model. In 
what follows, take U n and U n +\ to be two adjacent utterances. The backward 
looking center of U n , denoted as Q,([/„), represents the entity currently be¬ 
ing focused on in the discourse after U n is interpreted. The forward looking 
centers of U n , denoted as Cf(U n ), form an ordered list containing the entities 
mentioned in U n , all of which could serve as the C) 7 of the following utter¬ 
ance. In fact, Cb(U n + 1 ) is by definition the most highly ranked element of 
Cf(U n ) mentioned in U n +\. (The Q of the first utterance in a discourse is 
undefined.) As for how the entities in the Cf(U n ) are ordered, for simplic¬ 
ity’s sake we can use the grammatical role hierarchy encoded by (a subset 
of) the weights in the Lappin and Leass algorithm, repeated below. 1 

subject > existential predicate nominal > object > indirect ob¬ 
ject or oblique > demarcated adverbial PP 

Unlike the Lappin and Leass algorithm, however, there are no numerical 
weights attached to the entities on the list, they are simply ordered relative to 
each other. As a shorthand, we will call the highest-ranked forward-looking 
center C p (for ‘preferred center’). 

We describe a centering-based algorithm for pronoun interpretation 
due to Brennan et al (1987, henceforth BFP). (See also Walker et al (1994); 
for alternatives, see Kameyama (1986) and Strube and Hahn (1996), inter 
alia.) In this algorithm, preferred referents of pronouns are computed from 
relations that hold between the forward and backward looking centers in 
adjacent sentences. Four intersentential relationships between a pair of ut¬ 
terances U n and U n +\ are defined depending on the relationship between 

L This is an extended form of the hierarchy used in Brennan et al (1987), described below. 
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64(64+ 1 ), 0,(14), and C p (U n + 1 ); these are shown in Figure 18.8. 



C b (U n+l ) = C b (U n ) 
or undefined C b ( U n ) 

C b (U n+ i)?C b (U n ) 

64(I4+i) — C p (U n+ i) 

Continue 

Smooth-Shift 

C b {U n+l )^C p {U n+l ) 

Retain 

Rough-Shift 

Figure 18.8 Transitions in the BFP algorithm. 


The following rules are used by the algorithm. 

• Rule 1: If any element of Cf(U n ) is realized by a pronoun in utterance 
I4+i, then (4(I4+i) must be realized as a pronoun also. 

• Rule 2: Transition states are ordered. Continue is preferred to Retain 
is preferred to Smooth-Shift is preferred to Rough-Shift. 

Having defined these concepts and rules, the algorithm is defined as 
follows. 

1. Generate possible Q,-C/ combinations for each possible set of refer¬ 
ence assignments 

2. Filter by constraints, e.g., syntactic coreference constraints, selectional 
restrictions, centering rules and constraints 

3. Rank by transition orderings 

The pronominal referents that get assigned are those which yield the most 
preferred relation in Rule 2, assuming that Rule 1 and other coreference 
constraints (gender, number, syntactic, selectional restrictions) are not vio¬ 
lated. 

Let us step through passage (18.68), repeated below as (18.69), to il¬ 
lustrate the algorithm. 

(18.69) John saw a beautiful Acura Integra at the dealership. (IJ \) 

He showed it to Bob. (14) 

He bought it. (14) 

Using the grammatical role hierarchy to order the C/, for sentence U\ we 
get: 

Cf(Ui): {John, Integra, dealership} 

C p (Ui): John 
Cb{U\): undefined 
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Sentence U 2 contains two pronouns: he, which is compatible with John, and 
it, which is compatible with the Acura or the dealership. John is by definition 
C b (U 2 ), because he is the highest ranked member of Cf{U\) mentioned in 
U 2 (again, he is the only possible referent for he). We compare the resulting 
transitions for each possible referent of it. If we assume Prefers to the Acura, 
the assignments would be: 

Cf(U 2 ): {John, Integra, Bob} 

C P (U 2 ): John 
C b (U 2 ): John 

Result: Continue (C p (U 2 )=C b (U 2 ); C b (U\) undefined) 

If we assume it refers to the dealership, the assignments would be: 

Cf(U 2 ): {John, dealership, Bob} 

C P (U 2 ): John 
C b (U 2 ): John 

Result: Continue (C p (U 2 )=Cb(U 2 ); Cb{U\) undefined) 

Since both possibilities result in a Continue transition, the algorithm does 
not say which to accept. For the sake of illustration, we will assume that ties 
are broken in terms of the ordering on the previous Cf list. Thus, we will 
take it to refer to the Integra instead of the dealership, leaving the current 
discourse model as represented in the first possibility above. 

In sentence C/ 3 , he is compatible with either John or Bob, whereas it 
is compatible with the Integra. If we assume he refers to John, then John is 
Q>(CA) and the assignments would be: 

t/(C/ 3 ): {John, Acura} 

C P (U 3 ): John 
C b (U 3 ): John 

Result: Continue (C p (C/ 3 ) =C b (C/ 3 ) -C b (C/ 2 )) 

If we assume he refers to Bob, then Bob is C b {U^) and the assignments would 
be: 

C/(C/ 3 ): {Bob, Acura} 

C P (U 3 ): Bob 
C b (U 3 ): Bob 

Result: Smooth-Shift (C p (U 3 )=C b (U 3 ); C b (U 3 )^C b (U 2 )) 

Since a Continue is preferred to a Smooth-Shift per Rule 2, John is correctly 
taken to be the referent. 
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The main salience factors that the centering algorithm implicitly incor¬ 
porates include the grammatical role, recency, and repeated mention pref¬ 
erences. Unlike the Lappin and Leass algorithm, however, the manner in 
which the grammatical role hierarchy affects salience is indirect, since it is 
the resulting transition type that determines the final reference assignments. 
In particular, a referent in a low-ranked grammatical role will be preferred to 
one in a more highly ranked role if the former leads to a more highly ranked 
transition. Thus, the centering algorithm may (often, but not always, incor¬ 
rectly) resolve a pronoun to a referent that other algorithms would consider 
to be of relatively low salience (Lappin and Leass, 1994; Kehler, 1997a). For 
instance, in example (18.70), 

(18.70) Bob opened up a new dealership last week. John took a look at the 

Acuras in his lot. He ended up buying one. 

the centering algorithm will assign Bob as the referent of the subject pronoun 
he in the third sentence - since Bob is this assignment results in a 

Continue relation whereas assigning John results in a Smooth-Shift relation. 
On the other hand, the Hobbs and Lappin/Leass algorithms will assign John 
as the referent. 

Like the Hobbs algorithm, the centering algorithm was developed on 
the assumption that correct syntactic structures are available as input. In 
order to perform an automatic evaluation on naturally occurring data, the 
centering algorithm would have to be specified in greater detail, both in terms 
of how all noun phrases in a sentence are ordered with respect to each other 
on the Cf list (the current approach only includes nonembedded fillers of 
certain grammatical roles, generating only a partial ordering), as well as how 
all pronouns in a sentence can be resolved (e.g., recall the indeterminacy in 
resolving it in the second sentence of example (18.68)). 

Walker (1989), however, performed a manual evaluation of the center¬ 
ing algorithm on a corpus of 281 examples distributed over texts from three 
genres, and compared its performance to the Hobbs algorithm. The evalua¬ 
tion assumed adequate syntactic representations, grammatical role labeling, 
and selectional restriction information as input. Furthermore, in cases in 
which the centering algorithm did not uniquely specify a referent, only those 
cases in which the Hobbs algorithm identified the correct one were counted 
as errors. With this proviso, Walker reports an accuracy of 77.6% for cen¬ 
tering and 81.8% for Hobbs. See also Tetreault (1999) for a comparison 
between several centering-based algorithms and the Hobbs algorithm. 
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18.2 Text Coherence 

Much of the previous section focussed on the nature of anaphoric reference 
and methods for resolving pronouns in discourse. Anaphoric expressions 
have often been called cohesive devices (Halliday and Hasan, 1976), since devices E 
the coreference relations they establish serve to ‘tie’ different parts of a dis¬ 
course together, thus making it cohesive. While discourses often contain co¬ 
hesive devices, the existence of such devices alone does not satisfy a stronger 
requirement that a discourse must meet, that of being coherent. In this sec¬ 
tion, we describe what it means for a text to be coherent, and computational 
mechanisms for determining coherence. 

The Phenomenon 

Assume that you have collected an arbitrary set of well-formed and inde¬ 
pendently interpretable utterances, for instance, by randomly selecting one 
sentence from each of the previous chapters of this book. Do you have a 
discourse? Almost certainly not. The reason is that these utterances, when 
juxtaposed, will not exhibit coherence. Consider, for example, the differ- coherence 
ence between passages (18.71) and (18.72). 

(18.71) John hid Bill’s car keys. He was drunk. 

(18.72) ?? John hid Bill’s car keys. He likes spinach. 

While most people find passage (18.71) to be rather unremarkable, they 
find passage (18.72) to be odd. Why is this so? Like passage (18.71), 
the sentences that make up passage (18.72) are well formed and readily 
interpretable. Something instead seems to be wrong with the fact that the 
sentences are juxtaposed. The hearer might ask, for instance, what hiding 
someone’s car keys has to do with liking spinach. By asking this, the hearer 
is questioning the coherence of the passage. 

Alternatively, the hearer might try to construct an explanation that 
makes it coherent, for instance, by conjecturing that perhaps someone of¬ 
fered John spinach in exchange for hiding Bill’s car keys. In fact, if we con¬ 
sider a context in which we had known this already, the passage now sounds 
a lot better! Why is this? This conjecture allows the hearer to identify John’s 
liking spinach as the cause of his hiding Bill’s car keys, which would explain 
how the two sentences are connected. The very fact that hearers try to iden¬ 
tify such connections is indicative of the need to establish coherence as part 
of discourse comprehension. 
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The possible connections between utterances in a discourse can be 
relations e specified as a set of coherence relations. A few such relations, proposed 
by Hobbs (1979a), are given below. The terms So and S\ represent the mean¬ 
ings of the two sentences being related. 

Result: Infer that the state or event asserted by So causes or could cause the 
state or event asserted by Si. 

(18.73) John bought an Acura. His father went ballistic. 

Explanation: Infer that the state or event asserted by Si causes or could 
cause the state or event asserted by So. 

(18.74) John hid Bill’s car keys. He was drunk. 

Parallel: Infer /?(ai,a 2 ,...) from the assertion of So and p(b\.b 2 -...) from 
the assertion of Si, where at and bj are similar, for all i. 

(18.75) John bought an Acura. Bill leased a BMW. 

Elaboration: Infer the same proposition P from the assertions of So and Si. 

(18.76) John bought an Acura this weekend. He purchased a beautiful new 
Integra for 20 thousand dollars at Bill’s dealership on Saturday 
afternoon. 

Occasion: A change of state can be inferred from the assertion of So, whose 
final state can be inferred from Si, or a change of state can be inferred from 
the assertion of Si, whose initial state can be inferred from So. 

(18.77) John bought an Acura. He drove to the ballgame. 

A mechanism for identifying coherence could support a number of nat¬ 
ural language applications, including information extraction and summariza¬ 
tion. For example, discourses that are coherent by virtue of the Elaboration 
relation are often characterized by a summary sentence followed by one or 
more sentences adding detail to it, as in passage (18.76). Although there 
are two sentences describing events in this passage, the fact that we infer an 
Elaboration relation tells us that the same event is being described in each. 
A mechanism for identifying this fact could tell an information extraction 
or summarization system to merge the information from the sentences and 
produce a single event description instead of two. 
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An Inference Based Resolution Algorithm 

Each coherence relation described above is associated with one or more con¬ 
straints that must be met for it to hold. How can we apply these constraints? 
To do this, we need a method for performing inference. Perhaps the most 
familiar type of inference is deduction; recall from Section 14.3 that the 
central rule of deduction is modus ponens: 

(3 

a 

— r - 

An example of modus ponens is the following: 

All Acuras are fast. 

John’s car is an Acura. 

John’s car is fast. 

Deduction is a form of sound inference: if the premises are true, then the 
conclusion must be true. 

However, much of language understanding is based on inferences that 
are not sound. While the ability to draw unsound inferences allows for a 
greater range of inferences to be made, it can also lead to false interpretations 
and misunderstandings. A method for such inference is logical abduction 
(Pierce, 1955). The central rule of abductive inference is: 

P 

P 

a 

Whereas deduction runs an implication relation forward, abduction runs it 
backward, reasoning from an effect to a potential cause. An example of 
abduction is the following: 

All Acuras are fast. 

John’s car is fast. 

John’s car is an Acura. 


DEDUCTION 


SOUND 

INFERENCE 


ABDUCTION 


Obviously, this may be an incorrect inference: John’s car may be made by 
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another manufacturer yet still be fast. 

In general, a given effect (3 may have many potential causes a;. We 
generally will not want to merely reason from a fact to a possible explana¬ 
tion of it, we want to identify the best explanation of it. To do this, we need a 
method for comparing the quality of alternative abductive proofs. There are a 
variety of strategies one could employ for doing this. One possibility is to use 
a probabilistic model (Chamiak and Goldman, 1988; Chamiak and Shimony, 
1990), although issues arise in choosing the appropriate space over which to 
calculate these probabilities, and in finding a way to acquire them given the 
lack of a corpus of events. Another method is to use a purely heuristic strat¬ 
egy (Chamiak and McDermott, 1985, Chapter 10) indexChamiak, E., such 
as preferring the explanation with the smallest number of assumptions, or 
choosing the explanation that uses the most specific characteristics of the in¬ 
put. While such heuristics may be easy to implement, they generally prove 
to be too brittle and limiting. Finally, a more general cost-based strategy can 
be used which combines features (both positive and negative) of the proba¬ 
bilistic and heuristic approaches. The approach to abductive interpretation 
we illustrate here, due to Hobbs et al (1993), uses such a strategy. To sim¬ 
plify the discussion, however, we will largely ignore the cost component of 
the system, keeping in mind that one is nonetheless necessary. 

Hobbs et al (1993) apply their method to a broad range of problems 
in language interpretation; here we focus on its use in establishing discourse 
coherence, in which world and domain knowledge are used to determine 
the most plausible coherence relation holding between utterances. Let us 
step through the analysis that leads to establishing the coherence of pas¬ 
sage (18.71). First, we need axioms about coherence relations themselves. 
Axiom (18.78) states that a possible coherence relation is the Explanation 
relation; other relations would have analogous axioms. 

(18.78) (' iei,ej)Explanation{ei,ej ) D CoherenceRel(ei,ej) 

The variables ei and ej represent the events (or states) denoted by the two 
utterances being related, and the D symbol is used to denote the implica¬ 
tion relation. In this axiom and those given below, quantifiers always scope 
over everything to their right. This axiom tells us that, given that we need 
to establish a coherence relation between two events, one possibility is to 
abductively assume that the relation is Explanation. 

The Explanation relation requires that the second utterance express the 
cause of the effect that the first sentence expresses. We can state this as 
axiom (18.79). 
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(18.79) (Vei,ej)cause(ej,ei) D Explanation^,ej) 

In addition to axioms about coherence relations, we also need axioms 
representing general knowledge about the world. The first axiom we use says 
that if someone is drunk, then others will not want that person to drive, and 
that the former causes the latter (for convenience, the state of not wanting is 
denoted by the diswant predicate). 

(18.80) (\/x : y,ei)drunk(ei : x) D 

(3ej,ek)diswant(ej,y,ek) f\drive{e^x) l\cause{e^ej) 

Before we move on, a few notes are in order concerning this axiom and 
the others we will present. First, axiom (18.80) is stated using universal 
quantifiers to bind several of the variables, which essentially says that in 
all cases in which someone is drunk, all people do not want that person 
to drive. Although we might hope that this is generally the case, such a 
statement is nonetheless too strong. The way in which this is handled in 
the Hobbs et al. system is by including an additional relation, called an etc 
predicate, in the antecedent of such axioms. An etc predicate represents all 
the other properties that must be true for the axiom to apply, but which are 
too vague to state explicitly. These predicates therefore cannot be proven, 
they can only be assumed at a corresponding cost. Because rules with high 
assumption costs will be dispreferred to ones with low costs, the likelihood 
that the rule applies can be encoded in terms of this cost. Since we have 
chosen to simplify our discussion by ignoring costs, we will similarly ignore 
the use of etc predicates. 

Second, each predicate has what may look like an ‘extra’ variable in 
the first argument position; for instance, the drive predicate has two argu¬ 
ments instead of one. This variable is used to reify the relationship denoted 
by the predicate so that it can be referred to from argument places in other 
predicates. For instance, reifying the drive predicate with the variable et al¬ 
lows us to express the idea of not wanting someone to drive by referring to 
it in the final argument of the diswant predicate. 

Picking up where we left off, the second world knowledge axiom we 
use says that if someone does not want someone else to drive, then they do 
not want this person to have his car keys, since car keys enable someone to 
drive. 

(18.81) (\/x,y,ej,ek)diswant(ej,y,ek) Adrive(ek-,x) D 

(3z, ei,e m )diswant(ei,y,e m ) Ahave(e m ^x^z) Acarkeys(z,x) A 
cause(ej^ei) 
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The third axiom says that if someone doesn’t want someone else to have 
something, he might hide it from him. 

(18.82) iyx,y,z,ei,ej)diswant{ei,y,e m ) !\have{e m ,x,z) D 

(3e n )hide(e n ,y,x,z) A cause{e \, e n ) 

The final axiom says simply that causality is transitive, that is, if et causes ej 
and ej causes e*, then causes e*. 

(18.83) (Vei,ej,ek)cause(ei,ej) Acause(ej,ek) D cause{e^ek) 

Finally, we have the content of the utterances themselves, that is, that 
John hid Bill’s car keys (from Bill), 

(18.84) hide(e \, John, bill , ck) A carkeys(ck , bill ) 

and that someone described using the pronoun ‘he’ was drunk; we will rep¬ 
resent the pronoun with the free variable he. 

(18.85) drunk{e 2 ,he ) 

We can now see how reasoning with the content of the utterances along 
with the aforementioned axioms allows the coherence of passage (18.71) to 
be established under the Explanation relation. The derivation is summarized 
in Figure 18.9; the sentence interpretations are shown in boxes. We start by 
assuming there is a coherence relation, and using axiom (18.78) hypothesize 
that this relation is Explanation, 

(18.86) Explanation^ 1 ,^ 2 ) 

which, by axiom (18.79), means we hypothesize that 

(18.87) cause(e 2 ,^ 1 ) 

holds. By axiom (18.83), we can hypothesize that there is an intermediate 
cause 

(18.88) cause{e 2 ^e^) Acause{e^e\) 

and we can repeat this again by expanding the first conjunct of (18.88) to 
have an intermediate cause e^. 

(18.89) cause(e2,e4) Acause(e4,ei) 

We can take the hide predicate from the interpretation of the first sentence in 
(18.84) and the second cause predicate in (18.88), and, using axiom (18.82), 
hypothesize that John did not want Bill to have his car keys: 

(18.90) diswant(e?, Johnses) A have(e$^bill,ck) 
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From this, the carkeys predicate from (18.84), and the second cause predi¬ 
cate from (18.89), we can use axiom (18.81) to hypothesize that John does 
not want Bill to drive: 

(18.91) diswant(e 4 Johnses) A drived, bill) 

From this, axiom (18.80), and the second cause predicate from (18.89), we 
can hypothesize that Bill was drunk: 

(18.92) drunk(e 2 )bill) 

But now we find that we can ‘prove’ this fact from the interpretation of the 
second sentence if we simply assume that the free variable he is bound to 
Bill. Thus, the establishment of coherence has gone through, as we have 
identified a chain of reasoning between the sentence interpretations - one 
that includes unprovable assumptions about axiom choice and pronoun as¬ 
signment - that results in cause(e 2 ,e\), as required for establishing the Ex¬ 
planation relationship. 



This derivation illustrates a powerful property of coherence establish¬ 
ment, namely its ability to cause the hearer to infer information about the 
situation described by the discourse that the speaker has left unsaid. In this 
case, the derivation required the assumption that John hid Bill’s keys be¬ 
cause he did not want him to drive (presumably out of fear of him having 
an accident, or getting stopped by the police), as opposed to some other ex¬ 
planation, such as playing a practical joke on him. This cause is not stated 
anywhere in passage (18.71); it arises only from the inference process trig¬ 
gered by the need to establish coherence. In this sense, the meaning of a 
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discourse is greater than the sum of the meanings of its parts. That is, a dis¬ 
course typically communicates far more information than is contained in the 
interpretations of the individual sentences that comprise it. 

We now return to passage (18.72), repeated below as (18.94), which 
was notable in that it lacks the coherence displayed by passage (18.71), re¬ 
peated below as (18.93). 

(18.93) John hid Bill’s car keys. He was drunk. 

(18.94) ?? John hid Bill’s car keys. He likes spinach. 

We can now see why this is: there is no analogous chain of inference capable 
of linking the two utterance representations, in particular, there is no causal 
axiom analogous to (18.80) that says that liking spinach might cause some¬ 
one to not want you to drive. Without additional information that can sup¬ 
port such a chain of inference (such as the aforementioned scenario in which 
someone promised John spinach in exchange for hiding Bill’s car keys), the 
coherence of the passage cannot be established. 

Because abduction is a form of unsound inference, it must be possible 
to subsequently retract the assumptions made during abductive reasoning, 
defeasible that is, abductive inferences are defeasible. For instance, if passage (18.93) 
was followed by sentence (18.95), 

(18.95) Bill’s car isn’t here anyway; John was just playing a practical joke 

on him. 

the system would have to retract the original chain of inference connecting 
the two clauses in (18.93), and replace it with one utilizing the fact that the 
hiding event was part of a practical joke. 

In a more general knowledge base designed to support a broad range 
of inferences, we would probably want axioms that are more general that 
those we used to establish the coherence of passage (18.93). For instance, 
consider axiom (18.81), which says that if you do not want someone to drive, 
then you do not want them to have their car keys. A more general form of the 
axiom would say that if you do not want someone to perform an action, and 
an object enables them to perform that action, then you do not want them 
to have the object. The fact that car keys enable someone to drive would 
then be encoded separately, along with many other similar facts. Likewise, 
axiom (18.80) says that if someone is drunk, you don’t want them to drive. 
We might replace this with an axiom that says that if someone does not want 
something to happen, then they don’t want something that will likely cause 
it to happen. Again, the facts that people typically don’t want other people 
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to get into car accidents, and that drunk driving causes accidents, would be 
encoded separately. 

While it is important to have computational models that shed light on 
the coherence establishment problem, large barriers remain for employing 
this and similar methods on a wide-coverage basis. In particular, the large 
number of axioms that would be required to encode all of the necessary 
facts about the world, and the lack of a robust mechanism for constraining 
inference with such a large set of axioms, makes these methods largely im¬ 
practical in practice. Such problems have come to be informally known as 
Al-complete, a play on the term NP-complete in computer science. An AI- ai-complete 
complete problem is one that essentially requires all of the knowledge - and 
abilities to utilize it - that humans have. 

Other approaches to analyzing the coherence structure of a discourse 
have also been proposed. One that has received broad usage is Rhetorical 
Structure Theory (RST) (Mann and Thompson, 1987a), which proposes a 
set of 23 rhetorical relations that can hold between spans of text within a 
discourse. While RST is oriented more toward text description than inter¬ 
pretation, it has proven to be a useful tool for developing natural language 
generation systems. RST is described in more detail in Section 20.4. 

Coherence and Coreference The reader may have noticed another inter¬ 
esting property of the proof that passage (18.71) is coherent. While the 
pronoun he was initially represented as a free variable, it got bound to Bill 
during the derivation. In essence, a separate procedure for resolving the 
pronoun was not necessary; it happened as a side effect of the coherence es¬ 
tablishment procedure. In addition to the tree-search algorithm presented on 
page 683, Hobbs (1978b) proposes this use of the coherence establishment 
mechanism as a second approach to pronoun interpretation. 

This approach provides an explanation for why the pronoun in passage 
(18.71) is most naturally interpreted as referring to Bill, but the pronoun in 
passage (18.96) is most naturally interpreted as referring to John. 

(18.96) John lost Bill’s car keys. He was drunk. 

Establishing the coherence of passage (18.96) under Explanation requires an 
axiom that says that being drunk could cause someone to lose something. 

Because such an axiom will dictate that the person who is drunk must be 
the same as the person losing something, the free variable representing the 
pronoun will become bound to John. The only lexico-syntactic difference 
between passages (18.96) and (18.71), however, is the verb of the first sen¬ 
tence. The grammatical positions of the pronoun and potential antecedent 
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noun phrases are the same in both cases, so syntactically-based preferences 
do not distinguish between these. 

Discourse Connectives Sometimes a speaker will include a specific cue, 
called a connective , that serves to constrain the set of coherence relations 
that can hold between two or more utterances. For example, the connec¬ 
tive because indicates the Explanation relationship explicitly, as in passage 

(18.97) . 

(18.97) John hid Bill’s car keys because he was drunk. 

The meaning of because can be represented as cause(e 2 ,^ 1 ), which would 
play a similar role in the proof as the cause predicate that was introduced 
abductively via axiom (18.79). 

However, connectives do not always constrain the possibilities to a sin¬ 
gle coherence relation. The meaning of and , for instance, is compatible with 
the Parallel, Occasion, and Result relations introduced on page 690, as ex¬ 
emplified in (18.98)-(18.100) respectively. 

(18.98) John bought an Acura and Bill leased a BMW. 

(18.99) John bought an Acura and drove to the ballgame. 

(18.100) John bought an Acura and his father went ballistic. 

However, and is not compatible with the Explanation relation; unlike pas¬ 
sage (18.97), passage (18.101) cannot mean the same thing as (18.71). 

(18.101) John hid Bill’s car keys and he was drunk. 

While the coherence resolution procedure can use connectives to con¬ 
strain the range of coherence relations that can be inferred between a pair of 
utterances, they in and of themselves do not create coherence. Any coher¬ 
ence relation indicated by a connective must still be established. Therefore, 
adding because to example (18.72), for instance, still does not make it co¬ 
herent. 

(18.102) ?? John hid Bill’s car keys because he likes spinach. 

Coherence establishment fails here for the same reason it does for example 
(18.72), that is, the lack of causal knowledge explaining how liking spinach 
would cause one to hide someone’s car keys. 
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18.3 Discourse Structure 

In the previous section, we saw how the coherence of a pair of sentences can 
be established. We now ask how coherence can be established for longer dis¬ 
courses. Does one simply establish coherence relations between all adjacent 
pairs of sentences? 

It turns out that the answer is no. Just as sentences have hierarchical 
structure (that is, syntax), so do discourses. Consider passage (18.103). 

(18.103) • John went to the bank to deposit his paycheck. (SI) 

• He then took a train to Bill’s car dealership. (S2) 

• He needed to buy a car. (S3) 

• The company he works for now isn’t near any public 
transportation. (S4) 

• He also wanted to talk to Bill about their softball league. (S5) 

Intuitively, the structure of passage (18.103) is not linear. The discourse 
seems to be primarily about the sequence of events described in sentences 
SI and S2, whereas sentences S3 and S5 are related most directly to S2, and 
S4 is related most directly to S3. The coherence relationships between these 
sentences result in the discourse structure shown in Figure 18.10. 



Each node in the tree represents a group of locally coherent utterances, 
called a discourse segment. Roughly speaking, one can think of discourse se s g c me u nt se 
segments as being analogous to intermediate constituents in sentence syntax. 

We can extend the set of discourse interpretation axioms used in the 
last section to establish the coherence of larger, hierarchical discourses such 
as (18.103). The recognition of discourse segments, and ultimately discourse 
structure, results as a by-product of this process. 
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First, we add axiom (18.104), which states that a sentence is a dis¬ 
course segment. Here, w is the string of words in the sentence, and e the 
event or state described by it. 

(18.104) (Vw,e)sentence(w,e) D Segment (w^e) 

Next, we add axiom (18.105), which says that two smaller segments can 
be composed into a larger one if a coherence relation can be established 
between the two. 

(18.105) (Vh 7 !,W 2 ,, 6 * 2 :^0 Segment(wi , e \) A Segment(w 2 ,^ 2 ) 

A CoherenceRel{e\,e 2 ,e) D Segment{w\W 2 ^e) 

Note that extending our axioms for longer discourses has necessitated that 
we add a third argument to the CoherenceRel predicate (e). The value of 
this variable will be a combination of the information expressed by e\ and ^2 
that represents the main assertion of the resulting segment. For our purposes 
here, we will assume that subordinating relations such as Explanation pass 
along only one argument (in this case the first, that is, the effect), whereas 
coordinating relations such as Parallel and Occasion pass a combination 
of both arguments. These arguments are shown in parentheses next to each 
relation in Figure 18.10. 

Now, to interpret a coherent text W, one must simply prove that it is a 
segment, as expressed by statement (18.106). 

(18.106) (3e)Segment(W,e) 

These two rules will derive any possible binary branching segmental struc¬ 
ture for a discourse, as long as that structure can be supported by the estab¬ 
lishment of coherence relations between the segments. Herein lies a differ¬ 
ence between computing the syntactic structure of a sentence (see Chapter 9) 
and that of a discourse. Sentence-level grammars are generally complex, en¬ 
coding many syntactic facts about how different constituents (noun phrases, 
verb phrases) can modify in each other and in what order. The ‘discourse 
grammar’ above, on the contrary, is much simpler, encoding only two rules: 
a segment rewrites to two smaller segments, and a sentence is a segment. 
Which of the possible structures is actually assigned depends on how the 
coherence of the passage is established. 

Why would we want to compute discourse structure? Several appli¬ 
cations could benefit from it. A summarization system, for instance, might 
use it to select only the central sentences in the discourse, forgoing the in¬ 
clusion of subordinate information. For instance, a system for creating brief 
summaries might only include sentences SI and S2 when applied to pas- 
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sage (18.103), since the event representations for these were propagated to 
the top level node. A system for creating more detailed summaries might 
also include S3 and S5. Similarly, an information retrieval system might 
weight information in sentences that are propagated to higher-level parts of 
the discourse structure more heavily than information in ones that are not, 
and generation systems need knowledge of discourse structure to create co¬ 
herent discourse, as described in Chapter 20. 

Discourse structure may also be useful for natural language subtasks 
such as pronoun resolution. We already know from Section 18.1 that pro¬ 
nouns display a preference for recency, that is, they have a strong tendency 
to refer locally. But now we have two possible definitions for recency: re¬ 
cent in terms of the linear order of the discourse, or recent in terms of its 
hierarchical structure. It has been claimed that the latter definition is in fact 
the correct one, although admittedly the facts are not completely clear in all 
cases. 

In this section, we have briefly described one of several possible ap¬ 
proaches to recovering discourse structure. A different approach, one typi¬ 
cally applied to dialogues, will be described in Section 19.4. 

18.4 Psycholinguistic Studies of Reference and 
Coherence 


To what extent do the techniques described in this chapter model human 
discourse comprehension? A substantial body of psycholinguistic research 
has studied this question. 

For instance, a significant amount of work has been concerned with 
the extent to which people use the preferences described in Section 18.1 to 
interpret pronouns, the results of which are often contradictory. Clark and 
Sengal (1979) studied the effects that sentence recency plays in pronoun in¬ 
terpretation using a set of reading time experiments. After receiving and tim^experi 
acknowledging a three sentence context to read, human subjects were given 
a target sentence containing a pronoun. The subjects pressed a button when 
they felt that they understood the target sentence. Clark and Sengal found 
that the reading time was significantly faster when the referent for the pro¬ 
noun was evoked from the most recent clause in the context than when it 
was evoked from two or three clauses back. On the other hand, there was no 
significant difference between referents evoked from two clauses and three 
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clauses back, leading them to claim that “the last clause processed grants the 
entities it mentions a privileged place in working memory”. 

Crawley et al. (1990) compared the grammatical role parallelism pref¬ 
erence with a grammatical role preference, in particular, a preference for ref¬ 
erents evoked from the subject position of the previous sentence over those 
evoked from object position. Unlike previous studies which conflated these 
preferences by considering only subject-to-subject reference effects, Craw¬ 
ley et al. studied pronouns in object position to see if they tended to be as¬ 
signed to the subject or object of the last sentence. They found that in two 
answering task environments - a question answering task which revealed how the hu- 
naming task man subjects interpreted the pronoun, and a referent naming task in which 
the subjects identified the referent of the pronoun directly - the human sub¬ 
jects resolved pronouns to the subject of the previous sentence more often 
than the object. 

However, Smyth (1994) criticized the adequacy of Crawley et al.’s data 
for evaluating the role of parallelism. Using data that met more stringent re¬ 
quirements for assessing parallelism, Smyth found that subjects overwhelm¬ 
ingly followed the parallelism preference in a referent naming task. The 
experiment supplied weaker support for the preference for subject referents 
over object referents, which he posited as a default strategy when the sen¬ 
tences in question are not sufficiently parallel. 

Caramazza et al. (1977) studied the effect of the ‘implicit causality’ 
of verbs on pronoun resolution. Verbs were categorized in terms of having 
coffp E LETK)N subject bias or object bias using a sentence completion task. Subjects were 
given sentence fragments such as (18.107). 

(18.107) John telephoned Bill because he 

The subjects provided completions to the sentences, which identified to the 
experimenters what referent for the pronoun they favored. Verbs for which 
a large percentage of human subjects indicated a grammatical subject or ob¬ 
ject preference were categorized as having that bias. A sentence pair was 
then constructed for each biased verb: a ‘congruent’ sentence in which the 
semantics supported the pronoun assignment suggested by the verb’s bias, 
and an ‘incongruent’ sentence in which the semantics supported the opposite 
prediction. For example, sentence (18.108) is congruent for the subject-bias 
verb ‘telephoned’, since the semantics of the second clause supports assign¬ 
ing the subject John as the antecedent of he , whereas sentence (18.109) is 
incongruent since the semantics supports assigning the object Bill. 

(18.108) John telephoned Bill because he wanted some information. 
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(18.109) John telephoned Bill because he withheld some information. 


In a referent naming task, Caramazza et al. found that naming times were 
faster for the congruent sentences than for the incongruent ones. Perhaps 
surprisingly, this was even true for cases in which the two people mentioned 
in the first clause were of different genders (e.g., change John to Sue in 
examples (18.108) and (18.109)), thus rendering the reference unambiguous. 

Gamham et al. (1996) differentiated between two hypotheses about the 
manner in which implicit causality might affect pronoun resolution: the fo¬ 
cus hypothesis, which says, as might be suggested by the Caramazza et al. 
experiments, that such verbs have a priming effect on the filler of a particu¬ 
lar grammatical role and thus contribute information that can be used at the 
point at which the pronoun is interpreted, and the integration hypothesis, in 
which this information is only used after the clause has been comprehended 
and is being integrated with the previous discourse. They attempted to de¬ 
termine which hypothesis is correct using a probing task. After sentences 
were presented to establish a context, a sentence containing a pronoun was 
presented one word at a time. At appropriate points during the presenta¬ 
tion, the name of one of the possible referents was displayed, and the subject 
asked whether that person has been mentioned in the sentence so far. Gar- 
nham et al. found that the implicit causality information bias was generally 
not available right after the pronoun was given, but was utilized later in the 
sentence. 

Matthews and Chodorow (1988) analyzed the problem of intrasenten- 
tial reference and the predictions of syntactically-based search strategies. In 
a question answering task, they found that subjects exhibited slower com¬ 
prehension times for sentences in which a pronoun antecedent occupied an 
early, syntactically deep position than for sentences in which the antecedent 
occupied a late, syntactically shallow position. This result is consistent with 
the search process used in Hobbs’s tree search algorithm. 

There has also been psycholinguistic work concerned with testing the 
principles of centering theory. In a set of reading time experiments, Gor¬ 
don et al (1993) found that reading times were slower when the current 
backward-looking center was referred to using a full noun phrase instead 
of a pronoun, even though the pronouns were ambiguous and the proper 
names were not. This effect - which they called a repeated name penalty 
- was found only for referents in subject position, suggesting that the Cy is 
preferentially realized as a subject. Brennan (1995) analyzed how choice 
of linguistic form correlates with centering principles. She ran a set of ex- 
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periments in which a human subject watched a basketball game and had to 
describe it to a second person. She found that the human subjects tended to 
refer to an entity using a full noun phrase in subject position before subse¬ 
quently pronominalizing it, even if the referent had already been introduced 
in object position. 

Psycholinguistic studies have also addressed the processes people use 
to establish discourse coherence. Some of this work has focussed on the 
question of inference control, that is, which of the potentially infinite num¬ 
ber of possible inferences are actually made during interpretation (Singer, 
1994; Garrod and Sanford, 1994). These can be categorized in terms of be¬ 
ing necessary inferences, those which are necessary to establish coherence, 
and elaborative inferences, those which are suggested by the text but not 
necessary for establishing coherence. The position that only necessary infer¬ 
ences are made during interpretation has been called the deferred inference 
theory (Gamham, 1985) and the minimalist position (McKoon and Ratcliff, 
1992). As with pronoun interpretation, results of studies testing these ques¬ 
tions have yielded potentially contradictory results. Indeed, the results in 
each case depend to a large degree on the experimental setup and paradigm 
(Keenan et al , 1990). 

Johnson et al (1973), for instance, examined this question using a 
recognition judgement task. They presented subjects with passages such 
as (18.110). 

(18.110) When the man entered the kitchen he slipped on a wet spot and 

dropped the delicate glass pitcher on the floor. The pitcher was very 

expensive, and everyone watched the event with horror. 

The subjects were subsequently presented either with a sentence taken di¬ 
rectly from one of the passages, such as the first sentence of (18.110), or 
one that included an elaborative inference in the form of an expected conse¬ 
quence such as (18.111). The subjects were then asked if the sentence had 
appeared verbatim in one of the passages. 

(18.111) The man broke the delicate glass pitcher on the floor. 

Both types of sentence received a recognition rate in the mid-60% range, 
whereas control sentences that substantially altered the meaning were rec¬ 
ognized much less often (about 22%). By running a similar experiment that 
also measured subjects’ response times, Singer (1979) addressed the ques¬ 
tion of whether these inferences were made at the time the original sentence 
was comprehended (and thus truly elaborative), or at the time that the ex¬ 
pected consequence version was presented. While Singer also found that the 
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identical and expected consequence versions yield similar rates of positive 
responses, the judgements about the consequence versions took 0.2-0.3 sec¬ 
onds longer than for the identical sentences, suggesting that the inference 
was not made at comprehension time. 

Singer (1980) examined the question of when different types of infer¬ 
ences were made using passages such as (18.112)-(18.114). 

(18.112) The dentist pulled the tooth painlessly. The patient liked the new 
method. 

(18.113) The tooth was pulled painlessly. The dentist used a new method. 

(18.114) The tooth was pulled painlessly. The patient liked the new 
method. 

Each of these passages was presented to the subject, followed by the test 
sentence given in (18.115). 

(18.115) A dentist pulled the tooth. 

The information expressed in (18.115) is mentioned explicitly in (18.112), is 
necessary to establish coherence in (18.113), and is elaborative in (18.114). 
Singer found that subject verification times were approximately the same in 
the first two cases, but 0.25 seconds slower in the elaborative case, adding 
support to the deferred inference theory. 

Kintsch and colleagues have proposed and analyzed a ‘construction- 
integration’ model of discourse comprehension (Kintsch and van Dijk, 1978; 
van Dijk and Kintsch, 1983; Kintsch, 1988). They defined the concept of a 
text macrostructure, which is a hierarchical network of propositions that 
provides an abstract, semantic description of the global content of the text. 
Guindon and Kintsch (1984) evaluated whether the elaborative inferences 
necessary to construct the macrostructure accompany comprehension pro¬ 
cesses, using a lexical priming technique. Subjects read a passage and then 
were asked if a particular word pair was present in the text. Three types of 
word pairs were used: pairs that were not mentioned in the text but were 
related to the text macrostructure, pairs of ‘distractor words’ that were the¬ 
matically related to the text but not the macrostructure, and pairs of themat¬ 
ically unrelated distractor words. The number of ‘false alarms’ - in which a 
subject erroneously indicated that the words appeared in the text - was sig¬ 
nificantly higher for macrostructure pairs than for thematically related pairs, 
which in turn was higher than for pairs of thematically unrelated words. In 
the remaining cases - in which the subjects correctly rejected word pairs that 
did not appear - response times were significantly longer for macrostructure 
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words than thematically related pairs, which in turn were higher than for 
thematically unrelated words. 

Myers etal (1987) considered the question of how the degree of causal 
relatedness between sentences affects comprehension times and recall accu¬ 
racy. Considering a target sentence such as (18.116). 

(18.116) She found herself too frightened to move. 

they designed four context sentences, shown in (18.117)-(18.120), which 
form a continuum moving from high to low causal relatedness to (18.116). 

(18.117) Rose was attacked by a man in her apartment. 

(18.118) Rose saw a shadow at the end of the hall. 

(18.119) Rose entered her apartment to find a mess. 

(18.120) Rose came back to her apartment after work. 

Subjects were presented with cause-effect sentence pairs consisting of a con¬ 
text sentence and the target sentence. Myers et al. found that reading times 
were faster for more causally related pairs. After the subjects had seen a 
cued recall number of such pairs, Myers et al. then ran a cued recall experiment, in 
which the subjects were given one sentence from a pair and asked to recall 
as much as possible about the other sentence in the pair. They found that the 
subjects recalled more content for more causally related sentence pairs. 


18.5 Summary 

In this chapter, we saw that many of the problems that natural language pro¬ 
cessing systems face operate between sentences, that is, at the discourse 
level. Here is a summary of some of the main points we discussed: 

• Discourse interpretation requires that one build an evolving represen¬ 
tation of discourse state, called a discourse model , that contains repre¬ 
sentations of the entities that have been referred to and the relationships 
in which they participate. 

• Natural languages offer many ways to refer to entities. Each form of 
reference sends its own signals to the hearer about how it should be 
processed with respect to her discourse model and set of beliefs about 
the world. 

• Pronominal reference can be used for referents that have an adequate 
degree of salience in the discourse model. There are a variety of lex- 
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ical, syntactic, semantic, and discourse factors that appear to affect 
salience. 

• These factors can be modeled and weighed against each other in a pro¬ 
noun interpretation algorithm, due to Lappin and Leass (1994), that 
achieves performance in the mid-80% range on some genres. 

• Discourses are not arbitrary collections of sentences; they must be co¬ 
herent. Collections of well-formed and individually interpretable sen¬ 
tences often form incoherent discourses when juxtaposed. 

• The process of establishing coherence, performed by applying the con¬ 
straints imposed by one or more coherence relations , often leads to the 
inference of additional information left unsaid by the speaker. The 
unsound rule of logical abduction can be used for performing such in¬ 
ference. 

• Discourses, like sentences, have hierarchical structure. Intermediate 
groups of locally coherent utterances are called discourse segments. 
Discourse structure recognition can be viewed as a by-product of dis¬ 
course interpretation. 


Bibliographical and Historical Notes 

Building on the foundations set by early systems for natural language under¬ 
standing (Woods et al. , 1972; Winograd, 1972b; Woods, 1978), much of the 
fundamental work in computational approaches to discourse was performed 
in the late 70’s. Webber’s (1978, 1983) work provided fundamental insights 
into how entities are represented in the discourse model and the ways in 
which they can license subsequent reference. Many of the examples she pro¬ 
vided continue to challenge theories of reference to this day. Grosz (1977b) 
addressed the focus of attention that conversational participants maintain as 
the discourse unfolds. She defined two levels of focus; entities relevant to 
the entire discourse were said to be in global focus, whereas entities that are 
locally in focus (i.e., most central to a particular utterance) were said to be 
in immediate focus. Sidner (1979, 1983b) described a method for tracking 
(immediate) discourse foci and their use in resolving pronouns and demon¬ 
strative noun phrases. She made a distinction between the current discourse 
focus and potential foci, which are the predecessors to the backward and 
forward looking centers of centering theory respectively. 
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The roots of the centering approach originate from papers by Joshi and 
Kuhn (1979) and Joshi and Weinstein (1981), who addressed the relation¬ 
ship between immediate focus and the inferences required to integrate the 
current utterance into the discourse model. Grosz et al (1983) integrated 
this work with the prior work of Sidner and Grosz. This led to a manuscript 
on centering which, while widely circulated since 1986, remained unpub¬ 
lished until Grosz et al (1995). A series of papers on centering based on this 
manuscript/paper were subsequently published (Kameyama, 1986; Brennan 
etal , 1987; DiEugenio, 1990; Walker etal, 1994; Di Eugenio, 1996; Strube 
and Hahn, 1996; Kehler, 1997a, inter alia) indexDi Eugenio, B. indexStrube, 
M.. A collection of more recent centering papers appears in Walker et al 
(1998). 

Researchers in the linguistics community have proposed accounts of 
the information status that referents hold in a discourse model (Chafe, 1976; 
Prince, 1981; Ariel, 1990; Prince, 1992; Gundel et al , 1993; Lambrecht, 
1994, inter alia). Prince (1992), for instance, analyzes information status 
in terms of two crosscutting dichotomies: hearer status and discourse sta¬ 
tus, and shows how these statuses correlate with the grammatical position 
of referring expressions. Gundel et al (1993), on the other hand, posits a 
unidimensional scale with six statuses (called the givenness hierarchy ), and 
correlates them with the linguistic form of referring expressions. 

Beginning with Hobbs’s (1978b) tree-search algorithm, researchers 
have pursued syntax-based methods for identifying reference robustly in nat¬ 
urally occurring text. Building on the work of Lappin and Leass (1994), 
Kennedy and Boguraev (1996) describe a similar system that does not rely 
on a full syntactic parser, but merely a mechanism for identifying noun 
phrases and labeling their grammatical roles. Both approaches use Alshawi’s 
(1987) framework for integrating salience factors. An algorithm that uses 
this framework for resolving references in a multimodal (i.e., speech and 
gesture) human-computer interface is described in Huls et al (1995). A dis¬ 
cussion of a variety of approaches to reference in operational systems can be 
found in Mitkov and Boguraev (1997). 

Recently, several researchers have pursued methods for reference res¬ 
olution based on supervised learning (Connolly et al , 1994; Aone and Ben¬ 
nett, 1995; McCarthy and Lehnert, 1995; Kehler, 1997b; Ge et al , 1998, 
inter alia). In these studies, machine learning methods such as Bayesian 
model induction, decision trees, and maximum entropy modeling were used 
to train models from corpora annotated with coreference relations. A discus¬ 
sion of some issues that arise in annotating corpora for coreference can be 
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found in Poesio and Vieira (1998). 

The MUC-6 information extraction evaluation included a common eval¬ 
uation on coreference (Sundheim, 1995a). The task included coreference 
between proper names, aliases, definite noun phrases, bare nouns, pronouns, 
and even coreference indicated by syntactic relations such predicate nomi- 
nals (“The Integra is the world's nicest looking car”) and appositives (“the 
Integra , the world's nicest looking car”). Performance was evaluated by 
calculating recall and precision statistics based on the distance between the 
equivalence classes of coreferent descriptions produced by a system and 
those in a human-annotated answer key. Five of the seven sites which partic¬ 
ipated in the evaluation achieved in the range of 51%-63% recall and 62%- 
72% precision. A similar evaluation was also included as part of MUC-7. 

Several researchers have posited sets of coherence relations that can 
hold between utterances in a discourse (Halliday and Hasan, 1976; Hobbs, 
1979a; Longacre, 1983; Mann and Thompson, 1987a; Polanyi, 1988; Hobbs, 
1990; Sanders et al , 1992, inter alia). A compendium of over 350 rela¬ 
tions that have been proposed in the literature can be found in Hovy (1990). 
The Linguistic Discourse Model (Polanyi, 1988; Scha and Polanyi, 1988) 
is a framework in which discourse syntax is more heavily emphasized; in 
this approach, a discourse parse tree is built on a clause-by-clause basis in 
direct analogy with how a sentence parse tree is built on a constituent-by- 
constituent basis. A more recent line of work has applied a version of the 
tree-adjoining grammar formalism to discourse parsing (Webber et al , 1999, 
and citations therein). In addition to determining discourse structure and 
meaning, theories of discourse coherence have been used in algorithms for 
interpreting discourse-level linguistic phenomena, including pronoun resolu¬ 
tion (Hobbs, 1979a; Kehler, 2000), verb phrase ellipsis and gapping (Priist, 
1992; Asher, 1993; Kehler, 1993, 1994a), and tense interpretation (Las- 
carides and Asher, 1993; Kehler, 1994b, 2000). An extensive investigation 
into the relationship between coherence relations and discourse connectives 
can be found in Knott and Dale (1994). 


Exercises 


18.1 Early work in syntactic theory attempted to characterize rules for 
pronominalization through purely syntactic means. A rule was proposed in 
which a pronoun was interpreted by deleting it from the syntactic structure 
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of the sentence that contains it, and replacing it with the syntactic represen¬ 
tation of the antecedent noun phrase. 

Explain why the following sentences (called “Bach-Peters” sentences) 
are problematic for such an analysis. 

(18.121) The man who deserves it gets the prize he wants. 

(18.122) The pilot who shot at it hit the MIG that chased him. 

What other types of reference discussed on pages 667-672 are problematic 
for this type of analysis? 

Now, consider the following example (Karttunen, 1969). 

(18.123) The student who revised his paper did better than the student who 
handed it in as is. 

What is the preferred reading for the pronoun it, and why is it different and 
interesting? Describe why the syntactic account described above can be seen 
to predict this reading. Is this type of reading common? Construct some 
superficially similar examples that nonetheless appear not to have a similar 
reading. 

18.2 Webber (1978) offers examples in which the same referent appears to 
support either singular or plural agreement: 

(18.124) John gave Mary five dollars. It was more than he gave Sue. 

(18.125) John gave Mary five dollars. One of them was counterfeit. 

What might account for this? Describe how representations of referents like 
five dollars in the discourse model could be made to allow such behavior. 

Next, consider the following examples (from Webber and Baldwin 
(1992)): 

(18.126) John made a handbag from an inner tube. 

a. He sold it for twenty dollars. 

b. He had taken it from his brother’s car. 

c. Neither of them was particularly useful. 

d. * He sold them for fifty dollars. 

Why is plural reference to the handbag and the inner tube possible in sen¬ 
tence (18.126c), but not (18.126d)? Again, discuss how representations in 
the discourse model could be made to support this behavior. 

18.3 Draw syntactic trees for example (18.68) on page 681 and apply Hobbs’s 
tree search algorithm to it, showing each step in the search. 
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18.4 Recall that Hobbs’s algorithm does not have an explicit representa¬ 
tion of a discourse model, salience, or preferences. Discuss which of the 
preferences we have described are approximated by the search process over 
syntactic representations as Hobbs has defined it, and how. 

18.5 Hobbs (1977) cites the following examples from his corpus as being 
problematic for his tree-search algorithm. 

(18.127) The positions of pillars in one hall were marked by river boulders 
and a shaped convex cushion of bronze that had served as their 
footings. 

(18.128) They were at once assigned an important place among the scanty 
remains which record the physical developments of the human race 
from the time of its first appearance in Asia. 

(18.129) Sites at which the coarse grey pottery of the Shang period has 
been discovered do not extend far beyond the southernmost reach of 
the Yellow river, or westward beyond its junction with the Wei. 

(18.130) The thin, hard, black-burnished pottery, made in shapes of angular 
profile, which archeologists consider as the clearest hallmark of the 
Lung Shan culture, developed in the east. The site from which it takes 
its name is in Shantung. It is traced to the north-east as far as 
Liao-ning province. 

(18.131) He had the duty of performing the national sacrifices to heaven 
and earth: his role as source of honours and material rewards for 
services rendered by feudal lords and ministers is commemorated in 
thousands of inscriptions made by the recipients on bronze vessels 
which were eventually deposited in their graves. 

In each case, identify the correct referent of the underlined pronoun and the 
one that the algorithm will incorrectly identify. Discuss any factors that come 
into play in determining the correct referent in each case, and what types of 
information might be necessary to account for them. 

18.6 Consider the following passage, from Brennan et al (1987): 

(18.132) Brennan drives an Alfa Romeo. 

She drives too fast. 

Friedman races her on weekends. 

She goes to Laguna Seca. 

Identify the referent that the BFP algorithm finds for the pronoun in the final 
clause. Do you agree with this choice, or do you find the example ambigu- 
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ous? Discuss why introducing a new noun phrase in subject position, with 
a pronominalized reference in object position, might lead to an ambiguity. 
What preferences are competing here? 

18.7 The approaches to pronoun resolution discussed in this chapter de¬ 
pend on accurate parsing: Hobbs’s tree search algorithm assumes a full 
syntactic tree, and Lappin and Leass’s algorithm and centering requires that 
grammatical roles are assigned correctly. Given the current state of the art 
in syntactic processing, highly accurate syntactic structures are currently not 
reliably computable. Therefore, real-world algorithms must choose between 
one of two options: (i) use a parser to generate (often inaccurate) syntactic 
analyses and use them as such, or (ii) to eschew full syntactic analysis al¬ 
together and base the algorithm on partial syntactic analysis, such as noun 
phrase recognition. The Lappin and Leass system took the first option, us¬ 
ing a highly developed parser. However, one could take the second option, 
and augment their algorithm so that surface position is used to approximate 
a grammatical role hierarchy. 

Design a set of preferences for the Lappin and Leass method that as¬ 
sumes that only noun phrases are bracketed in the input. Construct six exam¬ 
ples: (i) two that are handled by both methods, (ii) two examples that Lappin 
and Leass handle but that are not handled by your adaptation, and (iii) two 
that are not handled correctly by either algorithm. Make sure the examples 
are nontrivially different. 

18.8 Consider passages (18.133a-b), adapted from Winograd (1972b). 

(18.133) The city council denied the demonstrators a permit because 

a. they feared violence. 

b. they advocated violence. 

What are the correct interpretations for the pronouns in each case? Sketch 
out an analysis of each in the interpretation as abduction framework, in 
which these reference assignments are made as a by-product of establish¬ 
ing the Explanation relation. 

18.9 Coherence relations may also apply temporal constraints to the events 
or states denoted by sentences in a discourse. These constraints must be com¬ 
patible with the temporal information indicated by the tenses used. Consider 
the two follow-on sentences in example (18.134). 

(18.134) John got in a car accident. 

a. He drank a six-pack of beer. 
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b. He had drunk a six-pack of beer. 

In what order do the events occur in each case? What coherence relation is 
operative in each case? Discuss what might account for this difference given 
the fact that causes precede effects. 

18.10 The coherence relations Result and Explanation are highly related, 
in that Explanation is essentially the same as Result except with the oppo¬ 
site ordering of clauses. These two relations are exemplified in examples 

(18.135) and (18.136). 

(18.135) Bill was drunk. John hid his car keys. 

(18.136) John hid Bill’s car keys. He was drunk. 

Now consider the following examples: 

(18.137) Bill was drunk, but John didn’t hide his car keys. 

(18.138) John hid Bill’s car keys, even though he wasn’t drunk. 

The coherence relations underlying examples (18.137) and (18.138) have 
been called Violated Expectation and Denial of Preventer respectively. 

Define the constraints that these two relations impose, using those for 
Result and Explanation as a guide. Discuss how we might consider all four 
relations to be parameterized versions of a single relation. 

18.11 Select an editorial column from your favorite newspaper, and deter¬ 
mine the discourse structure for a 10-20 sentence portion. What problems 
did you encounter? Were you helped by superficial cues the speaker included 
(e.g., discourse connectives) in any places? 
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C: I want you to tell me the names of the fellows on the St. 
Louis team. 

A: I’m telling you. Who’s on first, What’s on second, I Don’t 
Know is on third. 

C: You know the fellows’ names? 

A: Yes. 

C: Well, then, who’s playing first? 

A: Yes. 

C: I mean the fellow’s name on first. 

A: Who. 

C: The guy on first base. 

A: Who is on first. 

C: Well what are you askin’ me for? 

A: I’m not asking you - I’m telling you. Who is on first. 

Who ’.v on First - Bud Abbott and Lou Costello’s version 
of an old burlesque standard. 


The literature of the fantastic abounds in inanimate objects magically 
endowed with sentience and the gift of speech. From Ovid’s statue of Pyg¬ 
malion to Mary Shelley’s Frankenstein, Cao Xue Qin’s Divine Lumines¬ 
cent Stone-in-Waiting in the Court of Sunset Glow to Snow White’s mirror, 
there is something deeply touching about creating something and then hav¬ 
ing a chat with it. Legend has it that after finishing his sculpture of Moses, 
Michelangelo thought it so lifelike that he tapped it on the knee and com¬ 
manded it to speak. Perhaps this shouldn’t be surprising. Language itself 
has always been the mark of humanity and sentience, and conversation or tion^^ 
dialogue is the most fundamental and specially privileged arena of language, dialogue 
It is certainly the first kind of language we learn as children, and for most of 
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us, it is the kind of language we most commonly indulge in, whether we are 
ordering curry for lunch or buying postage stamps, participating in business 
meetings or talking with our families, booking airline flights or complaining 
about the weather. 

This chapter introduces the fundamental structures and algorithms in 
conversational agents, programs which communicate with users in natural 
language in order to book airline flights, answer questions, or act as a tele¬ 
phone interface to email. Many of these issues are also relevant for business 
meeting summarization systems and other spoken language understanding 
systems which must transcribe and summarize structured conversations like 
meetings. Section 19.1 begins by introducing some issues that make con¬ 
versation different from other kinds of discourse, introducing the important 
ideas of turn-taking, grounding, and implicature. Section 19.2 introduces 
the speech act or dialogue act, and Section 19.3 gives two different algo¬ 
rithms for automatic speech act interpretation. Section 19.4 describes how 
structure and coherence in dialogue differ from the discourse structure and 
coherence we saw in Chapter 18. Finally, Section 19.5 shows how each of 
these issues must be addressed in choosing an architecture for a dialogue 
manager as part of a conversational agent. 


19.1 What Makes Dialogue Different? 


Much about dialogue is similar to other kinds of discourse like the text mono¬ 
logues of Chapter 18. Dialogues exhibit anaphora and discourse structure 
and coherence, although with some slight changes from monologue. For ex¬ 
ample when resolving an anaphor in dialogue it’s important to look at what 
the other speaker said. In the following fragment from the air travel conver¬ 
sation in Figure 19.1 (to be discussed below), realizing that the pronoun they 
refers to non-stop flights in C’s utterance requires looking at A’s previous 
utterance. 

A 4 : Right. There’s three non-stops today. 

C 5 : What are they? 

Dialogue does differ from written monologue in deeper ways, however. 
The next few subsections highlight some of these differences. 
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T\irns and Utterances 

One difference between monologue and dialogue is that dialogue is char¬ 
acterized by turn-taking. Speaker A says something, then speaker B, then turn taking 
speaker A, and so on. Figure 19.1 shows a sample dialogue broken up into 
labeled turns; we’ve chosen this human-human dialogue because it concerns 
travel planning, a domain that is the focus of much recent human-machine 
dialogue research. 


Ci: ... I need to travel in May. 

Ai: And, what day in May did you want to travel? 

C 2 : OK uh I need to be there for a meeting that’s from the 12 th to the 
15th. 

A 2 : And you’re flying into what city? 

C 3 : Seattle. 

A 3 : And what time would you like to leave Pittsburgh? 

C 4 : Uh hmm I don’t think there’s many options for non-stop. 

A 4 : Right. There’s three non-stops today. 

C 5 : What are they? 

A 5 : The first one departs PGH at 10:00am arrives Seattle at 12:05 their 
time. The second flight departs PGH at 5:55pm, arrives Seattle at 
8 pm. And the last flight departs PGH at 8:15pm arrives Seattle at 
10:28pm. 

Ce: OK I’ll take the 5ish flight on the night before on the 1 1 th. 

A&: On the 11th? OK. Departing at 5:55pm arrives Seattle at 8 pm, US 
Air flight 115. 

C 7 : OK. _ 

Figure 19.1 A fragment from a telephone conversation between a speech 
recognition researcher client (C) and a travel agent (A). 


How do speakers know when is the proper time to contribute their turn? 
Consider the timing of the utterances in conversations like Figure 19.1. First, 
notice that this dialogue has no noticeable overlap. That is, the beginning of 
each speakers turn follows the end of the previous speaker’s turn (overlap 
would have been indicated by surrounding it with the # symbol). The actual 
amount of overlapped speech in American English conversation seems to be 
quite small; Levinson (1983) suggests the amount is less than 5% in gen¬ 
eral, and probably less for certain kinds of dialogue like the task-oriented 
dialogue in Figure 19.1. If speakers aren’t overlapping, perhaps they are 
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CONVERSA¬ 

TION 

ANALYSIS 


waiting a while after the other speaker? This is also very rare. The amount 
of time between turns is quite small, generally less than a few hundred mil¬ 
liseconds.even in multi-party discourse. In fact, it may take more than this 
few hundred milliseconds for the next speaker to plan the motor routines for 
producing their utterance, which means that speakers begin motor planning 
for their next utterance before the previous speaker has finished. For this to 
be possible, natural conversation must be set up in such a way that (most 
of the time) people can quickly figure out who should talk next, and ex¬ 
actly when they should talk. This kind of turn-taking behavior is generally 
studied in the field of Conversation Analysis (CA). In a key conversation- 
analytic paper, Sacks et al (1974) argued that turn-taking behavior, at least 
in American English, is governed by a set of turn-taking rules. These rules 
apply at a transition-relevance place, or TRP; places where the structure 
of the language allows speaker shift to occur. Here is a simplified version of 
the turn-taking rules, grouped into a single three-part rule; see Sacks et al 
(1974) for the complete rules: 

(19.1) TUm-taking Rule. At each TRP of each turn: 

a. If during this turn the current speaker has selected A as the next 
speaker then A must speak next. 

b. If the current speaker does not select the next speaker, any other 
speaker may take the next turn. 

c. If no one else takes the next turn, the current speaker may take 
the next turn. 

There are a number of important implications of rule (19.1) for di¬ 
alogue modeling. First, subrule (19.1a) implies that there are some utter¬ 
ances by which the speaker specifically selects who the next speaker will 
be. The most obvious of these are questions, in which the speaker selects 
another speaker to answer the question. Two-part structures like question- 
SB^ answer are called adjacency pairs (Schegloff, 1968); other adjacency 
pairs include greeting followed by greeting, compliment followed 
by downplayer, request followed by grant. We will see that these 
pairs and the dialogue expectations they set up will play an important role in 
dialogue modeling. 

Subrule (19.1a) also has an implication for the interpretation of silence. 
While silence can occur after any turn, silence which follows the first part of 
||“ T an adjacency pair-part is significant silence. For example (Levinson, 1983) 
notes the following example from Atkinson and Drew (1979); pause lengths 
are marked in parentheses (in seconds): 
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(19.2) A: 

Is there something bothering you or not? 


(1.0) 

A: 

Yes or no? 


(1.5) 

A: 

Eh? 

B: 

No. 


Since A has just asked B a question, the silence is interpreted as a 
refusal to respond, or perhaps a dispreferred response (a response, like say- ^ S R P R R E E D 
ing ‘no’ to a request, which is stigmatized). By contrast, silence in other 
places, for example a lapse after a speaker finishes a turn, is not generally 
interpretable in this way. These facts are relevant for user interface design 
in spoken dialogue systems; users are distributed by the pauses in dialogue 
systems caused by slow speech recognizers (Yankelovich et al , 1995). 

Another implication of (19.1) is that transitions between speakers don’t 
occur just anywhere; the transition-relevance places where they tend to oc¬ 
cur are generally at utterance boundaries. This brings us to the next differ- utterance 
ence between spoken dialogue and textual monologue (of course dialogue 
can be written and monologue spoken; but most current applications of di¬ 
alogue involve speech): the spoken utterance versus the written sentence. 

Recall from Chapter 9 that utterances differ from written sentences in a num¬ 
ber of ways. They tend to be shorter, are more likely to be single clauses, the 
subjects are usually pronouns rather than full lexical noun phrases, and they 
include filled pauses, repairs, and restarts. 

One very important difference not discussed in Chapter 9 is that while 
written sentences and paragraphs are relatively easy to automatically seg¬ 
ment from each other, utterances and turns are quite complex to segment. 

Utterance boundary detection is important since many computational dia¬ 
logue models are based on extracting an utterance as a primitive unit. The 
segmentation problem is difficult because a single utterance may be spread 
over several turns, or a single turn may include several utterances. For ex¬ 
ample in the following fragment of a dialogue between a travel agent and a 
client, the agent’s utterance stretches over three turns: 


(19.3) 


A: Yeah yeah the um let me see here we’ve got you on American 
flight nine thirty eight 
C: Yep. 

A: leaving on the twentieth of June out of Orange County John 
Wayne Airport at seven thirty p.m. 

C: Seven thirty. 

A: and into uh San Francisco at eight fifty seven. 
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By contrast, the example below has three utterances in one turn: 

(19.4) A: Three two three and seven five one. OK and then does he 
know there is a nonstop that goes from Dulles to San Fran¬ 
cisco? Instead of connection through St. Louis. 

Algorithms for utterance segmentation are based on many boundary 
cues such as: 

cue words • cue words: Cue (or ‘clue’) words like well , and , so , etc., tend to occur 
at the beginnings and ends of utterances (Reichman, 1985; Hirschberg 
andLitman, 1993). 

• A-gram word sequences: Specific word sequences often indicate bound¬ 
aries. A-gram grammars can be trained on a training set labeled with 
special utterance-boundary tags, and then HMM decoding techniques 
can be used to find the most likely utterance boundaries in a unlabeled 
test set (Mast et al , 1996; Meteer and Iyer, 1996; Stolcke and Shriberg, 
1996a). 

• prosody: Prosodic features like pitch, accent, phrase-final lengthening 
and pause duration play a role in utterance/tum segmentation, as dis¬ 
cussed in Chapter 4, although the relationship between utterances and 
prosodic units like the intonation unit (Du Bois et al , 1983) or into- 

® national phrase (Beckman and Pierrehumbert, 1986)) is complicated 

(Ladd, 1996; Ford and Thompson, 1996; Ford et al , 1996, inter alia) 
indexFord, C.. 

The relationship between turns and utterances seems to be more one- 
to-one in human-machine dialogue than the human-human dialogues dis¬ 
cussed above. Probably this is because the simplicity of current systems 
causes people to use simpler utterances and turns. Thus while computational 
tasks like meeting summarization require solving quite difficult segmenta¬ 
tion problems, segmentation may be easier for conversational agents. 


Grounding 


COMMON 

GROUND 


GROUND 

ACKNOWL¬ 

EDGE 


Another important characteristic of dialogue that distinguishes it from mono¬ 
logue is that it is a collective act performed by the speaker and the hearer. 
One implication of this collectiveness is that, unlike in monologue, the speaker 
and hearer must constantly establish common ground (Stalnaker, 1978), the 
set of things that are mutually believed by both speakers. The need to achieve 
common ground means that the hearer must ground or acknowledge the 
speaker’s utterances, or else make it clear that the there was a problem in 
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reaching common ground. For example, consider the role of the word mm- 
hmm in the following fragment of a conversation between a travel agent and 
a client: 

A: ... returning on US flight one one one eight. 

C: Mm hmm 

The word mm-hmm here is a continuer, also often called a backchan- 
nel or an acknowledgement token. A continuer is a short utterance which 
acknowledges the previous utterance in some way, often cueing the other 
speaker to continue talking (Jefferson, 1984; Schegloff, 1982; Yngve, 1970). 
By letting the speaker know that the utterance has ‘reached’ the addressee, 
a continuer/backchannel thus helps the speaker and hearer achieve common 
ground. Continuers are just one of the ways that the hearer can indicate 
that she believes she understands what the speaker meant. Clark and Schae¬ 
fer (1989) discuss five main types of methods, ordered from weakest to 
strongest: 

1. Continued attention: B shows she is continuing to attend and there¬ 
fore remains satisfied with A’s presentation. 

2. Relevant next contribution: B starts in on the next relevant contribu¬ 
tion. 

3. Acknowledgement: B nods or says a continuer like uh-huh , yeah , or 
the like, or an assessment like that's great. 

4. Demonstration: B demonstrates all or part of what she has under¬ 
stood A to mean, for example by paraphrasing or reformulating A’s 
utterance, or by collaboratively completing A’s utterance. 

5. Display: B displays verbatim all or part of A’s presentation. 

The following excerpt from our sample conversation shows a display 
of understanding by A’s repetition of on the 11th: 

Ce: OK I’ll take the 5ish flight on the night before on the 11th. 

As: On the 11th? 

Such repeats or reformulations are often done in the form of questions 
like As; we return to this issue on page 735. 

Not all of Clark and Shaefer’s methods are available for telephone- 
based conversational agents. Without eye-gaze as an visual indicator of at¬ 
tention, for example, continued attention isn’t an option. In fact Stifelman 
et al (1993) and (Yankelovich et al , 1995) point out that users of speech- 
based interfaces are often confused when the system doesn’t give them an 
explicit acknowledgement signal after processing the user’s utterances. 


CONTINUER 

BACKCHAN- 

NEL 
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REQUEST 
FOR REPAIR 


IMPLICATURE 


MAXIMS 


In addition to these acknowledgement acts, a hearer can indicate that 
there were problems in understanding the previous utterance, for example by 
issuing a request for repair like the following Switchboard example: 

A: Why is that? 

B: Huh? 

A: Why is that? 

Conversational Implicature 

The final important property of conversation is the way the interpretation of 
an utterance relies on more than just the literal meaning of the sentences. 
Consider the client’s response C 2 from the sample conversation above, re¬ 
peated here: 

Ai: And, what day in May did you want to travel? 

C 2 : OK uh I need to be there for a meeting that’s from the 12th to the 15th. 

Notice that the client does not in fact answer the question. The client 
merely states that he has a meeting at a certain time. The semantics for this 
sentence produced by a semantic interpreter will simply mention this meet¬ 
ing. What is it that licenses the agent to infer that the client is mentioning 
this meeting so as to inform the agent of the travel dates? 

Now consider another utterance from the sample conversation, this one 
by the agent: 

A 4 : ... There’s three non-stops today. 

Now this statement would still be true if there were seven non-stops 
today, since if there are seven of something, there are by definition also three. 
But what the agent means here is that there are three and not more than 
three non-stops today. How is the client to infer that the agent means only 
three non-stops? 

These two cases have something in common; in both cases the speaker 
seems to expect the hearer to draw certain inferences; in other words, the 
speaker is communicating more information than seems to be present in the 
uttered words. These kind of examples were pointed out by Grice (1975, 
1978) as part of his theory of conversational implicature. Implicature 
means a particular class of licensed inferences. Grice proposed that what en¬ 
ables hearers to draw these inferences is that conversation is guided by a set 
of maxims, general heuristics which play a guiding role in the interpretation 
of conversational utterances. He proposed the following four maxims: 
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• Maxim of Quantity: Be exactly as informative as is required: 

1. Make your contribution as informative as is required (for the cur¬ 
rent purposes of the exchange). 

2. Do not make your contribution more informative than is required. 

• Maxim of Quality: Try to make your contribution one that is true: 

1. Do not say what you believe to be false. 

2. Do not say that for which you lack adequate evidence. 

• Maxim of Relevance: Be relevant. 

• Maxim of Manner: Be perspicuous: 

1. Avoid obscurity of expression. 

2. Avoid ambiguity. 

3. Be brief (avoid unnecessary prolixity). 

4. Be orderly. 

It is the Maxim of Quantity (specifically Quantity 1) that allows the 
hearer to know that three non-stops didn’t mean seven non-stops. This is 
because the hearer assumes the speaker is following the maxims, and thus 
if the speaker meant seven non-stops she would have said seven non-stops 
(‘as informative as is required’). The Maxim of Relevance is what allows the 
agent to know that the client wants to travel by the 12th. The agent assumes 
the client is following the maxims, and hence would only have mentioned 
the meeting if it was relevant at this point in the dialogue. The most natural 
inference that would make the meeting relevant is the inference that the client 
meant the agent to understand that his departure time was before the meeting 
time. 

These three properties of conversation (turn-taking, grounding, and 
implicature) will play an important role in the discussion of dialogue acts, 
dialogue structure, and dialogue managers in the next sections. 

19.2 Dialogue Acts 


An important insight about conversation, due to Austin (1962), is that an 
utterance in a dialogue is a kind of action being performed by the speaker. 
This is particularly clear in performative sentences like the following: 

(19.5) I name this ship the Titanic. 

(19.6) I second that motion. 


QUANTITY 


QUALITY 

RELEVANCE 

MANNER 


PERFORMA¬ 

TIVE 
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(19.7) I bet you five dollars it will snow tomorrow. 

When uttered by the proper authority, for example, (19.5) has the effect of 
changing the state of the world (causing the ship to have the name Titanic) 
just as any action can change the state of the world. Verbs like name or 
second which perform this kind of action are called performative verbs, and 
speech acts Austin called these kinds of actions speech acts. What makes Austin’s work 
so far-reaching is that speech acts are not confined to this small class of 
performative verbs. Austin’s claim is that the utterance of any sentence in a 
real speech situation constitutes three kinds of acts: 

• locutionary act: the utterance of a sentence with a particular meaning 

• illocutionary act: the act of asking, answering, promising, etc., in 
uttering a sentence. 

• perlocutionary act: the (often intentional) production of certain ef¬ 
fects upon the feelings, thoughts, or actions of the addressee in uttering 
a sentence. 

For example, Austin explains that the utterance of (19.8) might have the 
K TI0N - illocutionary force of protesting and the perlocutionary effect of stopping 
the addressee from doing something, or annoying the addressee. 

(19.8) You can’t do that. 

The term speech act is generally used to describe illocutionary acts 
rather than either of the other two levels. Searle (1975b), in modifying a 
taxonomy of Austin’s, suggests that all speech acts can be classified into one 
of 5 major classes: 

• Assertives: committing the speaker to something’s being the case ( sug¬ 
gesting , putting forward , swearing , boasting , concluding). 

• Directives: attempts by the speaker to get the addressee to do some¬ 
thing ( asking , ordering , requesting , inviting , advising , begging). 

• Commissives: committing the speaker to some future course of action 
(promising, planning, vowing, betting, opposing). 

• Expressives: expressing the psychological state of the speaker about a 
state of affairs thanking , apologizing , welcoming , deploring. 

• Declarations: bringing about a different state of the world via the ut¬ 
terance (including many of the performative examples above; I resign , 
You're fired.) 
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While speech acts provide a useful characterization of one kind of 
pragmatic force, more recent work, especially in building dialogue systems, 
has significantly expanded this core notion, modeling more kinds of con¬ 
versational functions that an utterance can play. The resulting enriched acts 
are called dialogue acts (?) or conversational moves (Power, 1979; Carletta 
et ah, 1997). A recent ongoing effort to develop dialogue act tagging scheme 
is the DAMSL (Dialogue Act Markup in Several Layers) architecture (Allen 
and Core, 1997; Walker et ah , 1996; Carletta et ah , 1997; Core et ah , 1999), 
which codes various levels of dialogue information about utterances. Two 
of these levels, the forward looking function and the backward looking 
function, are extensions of speech acts which draw on notions of dialogue 
structure like the adjacency pairs mentioned earlier as well as notions of 
grounding and repair. For example, the forward looking function of an utter¬ 
ance corresponds to something like the Searle/Austin speech act, although 
the DAMSL tag set is hierarchical, and is focused somewhat on the kind of 
dialogue acts that tend to occur in task-oriented dialogue: 


STATEMENT 

a claim made by the speaker 

INFO-REQUEST 

a question by the speaker 

CHECK 

a question for confirming information 
(see below) 

INFLUENCE-ON-ADDRESSEE 

(=Searle’s directives) 

OPEN-OPTION 

a weak suggestion or listing of options 

ACTION-DIRECTIVE 

an actual command 

INFLUENCE-ON-SPEAKER 

(^Austin’s commissives) 

OFFER 

speaker offers to do something, 

(subject to confirmation) 

COMMIT 

speaker is committed to doing something 

CONVENTIONAL 

other 

OPENING 

greetings 

CLOSING 

farewells 

THANKING 

thanking and responding to thanks 


The backward looking function of DAMSL focuses on the relationship 
of an utterance to previous utterances by the other speaker. These include 
accepting and rejecting proposals (since DAMSL is focused on task-oriented 
dialogue), as well as grounding and repair acts discussed above. 


DIALOGUE 

ACTS 

MOVES 
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AGREEMENT 

ACCEPT 

ACCEPT-PART 

MAYBE 

REJECT-PART 

REJECT 

HOLD 

ANSWER 

UNDERSTANDING 

SIGNAL-NON-UNDER. 

SIGNAL-UNDER. 

ACK 

REPEAT-REPHRASE 

COMPLETION 


speaker’s response to previous proposal 
accepting the proposal 
accepting some part of the proposal 
neither accepting nor rejecting the proposal 
rejecting some part of the proposal 
rejecting the proposal 

putting off response, usually via subdialogue 
answering a question 
whether speaker understood previous 
speaker didn’t understand (usually = NTRI) 
speaker did understand 
demonstrated via continuer or assessment 
demonstrated via repetition or reformulation 
demonstrated via collaborative completion 


Figure 19.2 shows a labeling of our sample conversation using versions 
of the DAMSL Forward and Backward tags. 


19.3 Automatic Interpretation of Dialogue Acts 


The previous section introduced dialogue acts and other activities that ut¬ 
terances can perform. This section turns to the problem of identifying or 
interpreting these acts. That is, how do we decide whether a given input is a 
QUESTION, a STATEMENT, a SUGGEST (directive), or an ACKNOWL¬ 
EDGEMENT? 

At first glance, this problem looks simple. We saw in Chapter 9 that 
yes-no-questions in English have aux-inversion, statements have declarative 
syntax (no aux-inversion), and commands have imperative syntax (sentences 
with no syntactic subject), as in (19.9): 

(19.9) YES-NO-QUESTION Will breakfast be served on USAir 1557? 
STATEMENT I don’t care about lunch 

COMMAND Show me flights from Milwaukee to Or¬ 

lando on Thursday night. 

It seems from (19.9) that the surface syntax of the input ought to tell us what 
illocutionary act it is. Alas, as is clear from Abbot and Costello’s famous 
Who's on First routine at the beginning of the chapter, things are not so sim¬ 
ple. The mapping between surface form and illocutionary act is not obvious 
or even one-to-one. 
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[assert] 

Ci: ... I need to travel in May. 

[info- 

Ai: And, what day in May did you want to travel? 

req,ack] 


[assert, 

C 2 : OK uh I need to be there for a meeting that’s from the 

answer] 

12th to the 15th. 

[info- 

A 2 : And you’re flying into what city? 

req,ack] 


[assert,answer] C 3 : Seattle. 

[info- 

A 3 : And what time would you like to leave Pittsburgh? 

req,ack] 


[check,hold] 

C 4 : Uh hmm I don’t think there’s many options for non- 


stop. 

[accept,ack] 

A 4 : Right. 

[assert] 

There’s three non-stops today. 

[info-req] 

C 5 : What are they? 

[assert, 

A 5 : The first one departs PGH at 10:00am arrives Seattle 

open-option] 

at 12:05 their time. The second flight departs PGH 


at 5:55pm, arrives Seattle at 8 pm. And the last flight 


departs PGH at 8:15pm arrives Seattle at 10:28pm. 

[accept,ack] 

Ce,: OK I’ll take the 5ish flight on the night before on the 


llth. 

[check,ack] 

A, 6 : On the llth? 

[assert,ack] 

OK. Departing at 5:55pm arrives Seattle at 8 pm, US 


Air flight 115. 

[ack] 

C 7 : OK. 

Figure 19.2 
Figure 19.1. 

A potential DAMSL labeling of the conversation fragment in 


For example, the following utterance spoken to an ATIS system looks 
like a YES-NO-QUESTION meaning something like Are you capable of 
giving me a list of...?: 

(19.10) Can you give me a list of the flights from Atlanta to Boston? 

In fact, however, this person was not interested in whether the system 
was capable of giving a list; this utterance was actually a polite form of a 
DIRECTIVE or a REQUEST, meaning something more like Please give me 
a list of.... Thus what looks on the surface like a QUESTION can really be 
a REQUEST. 

Similarly, what looks on the surface like a STATEMENT can really be 
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a QUESTION. A very common kind of question, called a CHECK question 
(Carletta et al , 1997; Labov and Fanshel, 1977), is used to ask 

the other participant to confirm something that this other participant 
has privileged knowledge about. These CHECKs are questions, but they 
have declarative surface form, as the boldfaced utterance in the following 
snippet from another travel agent conversation: 

A OPEN-OPTION I was wanting to make some arrangements for 
a trip that I’m going to be taking uh to LA uh 
beginning of the week after next. 

B HOLD OK uh let me pull up your profile and I’ll be 

right with you here, [pause] 

B CHECK And you said you wanted to travel next week? 

A ACCEPT Uh yes. 


Utterances which use a surface statement to ask a question, or a surface 
speechacts question to issue a request, are called indirect speech acts.How can a surface 
yes-no-question like Can you give me a list of the flights from Atlanta to 
Boston? be mapped into the correct illocutionary act REQUEST. Solutions 
to this problem lie along a continuum of idiomaticity. At one end of the 
continuum is the idiom approach, which assumes that a sentence structure 
like Can you give me a list? or Can you pass the salt? is ambiguous between 
a literal meaning as a YES-NO-QUESTION and an idiomatic meaning as 
a request. The grammar of English would simply list REQUEST as one 
meaning of Can you X. One problem with this approach is that there are 
many ways to make an indirect request, each of which has slightly different 
surface grammatical structure (see below). The grammar would have to store 
the REQUEST meaning in many different places. Furthermore, the idiom 
approach doesn’t make use of the fact that there are semantic generalizations 
about what makes something a legitimate indirect request. 

The alternative end of the continuum is the inferential approach, first 
proposed by Gordon and Lakoff (1971) and taken up by Searle (1975a). 
Their intuition was that a sentence like Can you give me a list of flights from 
Atlanta? is unambiguous, meaning only Do you have the ability to give me 
a list of flights from Atlanta?. The directive speech act Please give me a list 
inferred of flights from Atlanta is inferred by the hearer. 

The next two sections will introduce two models of dialogue act in¬ 
terpretation: an inferential model called the plan inference model, and an 
idiom-based model called the cue model. 
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Plan-Inferential Interpretation of Dialogue Acts 

The plan-inference approach to dialogue act interpretation was first proposed 
by Gordon and Lakoff (1971) and Searle (1975a) when they noticed that 
there was a structure to what kind of things a speaker could do to make an 
indirect request. In particular, they noticed that a speaker could mention or 
question various quite specific properties of the desired activity to make an 
indirect request; here is a partial list with examples from the ATIS corpus: 

1. The speaker can question the hearer’s ability to perform the activity 

• Can you give me a list of the flights from Atlanta to Boston? 

• Could you tell me if Delta has a hub in Boston? 

• Would you be able to, uh, put me on a flight with Delta? 

2. The speaker can mention speaker’s wish or desire about the activity 

• I want to fly from Boston to San Francisco. 

• I would like to stop somewhere else in between. 

• I’m looking for one way flights from Tampa to Saint Louis. 

• I need that for Tuesday. 

• I wonder if there are any flights from Boston to Dallas. 

3. The speaker can mention the hearer’s doing the action 

• Would you please repeat that information? 

• Will you tell me the departure time and arrival time on this Amer¬ 
ican flight? 

4. The speaker can question the speaker’s having permission to receive 
results of the action 

• May I get a lunch on flight U A two one instead of breakfast? 

• Could I have a listing of flights leaving Boston? 

Based on this realization, Searle (1975a, p. 73) proposed that the hearer’s 
chain of reasoning upon hearing Can you give me a list of the flights from 
Atlanta to Boston? might be something like the following (modified for our 
ATIS example): 

1. X has asked me a question about whether I have the ability to give a 
list of flights. 

2. I assume that X is being cooperative in the conversation (in the Gricean 
sense) and that his utterance therefore has some aim. 

3. X knows I have the ability to give such a list, and there is no alternative 
reason why X should have a purely theoretical interest in my list-giving 
ability. 
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4. Therefore X’s utterance probably has some ulterior illocutionary point. 
What can it be? 

5. A preparatory condition for a directive is that the hearer have the ability 
to perform the directed action. 

6. Therefore X has asked me a question about my preparedness for the 
action of giving X a list of flights. 

7. Furthermore, X and I are in a conversational situation in which giving 
lists of flights is a common and expected activity. 

8. Therefore, in the absence of any other plausible illocutionary act, X is 
probably requesting me to give him a list of flights. 

The inferential approach has a number of advantages. First, it explains 
why Can you give me a list of flights from Boston? is a reasonable way of 
making an indirect request and Boston is in New England is not: the former 
mentions a precondition for the desired activity, and there is a reasonable in¬ 
ferential chain from the precondition to the activity itself. The inferential ap¬ 
proach has been modeled by Allen, Cohen, and Perrault and their colleagues 
bdi in a number of influential papers on what have been called BDI (belief, de¬ 

sire, and intention) models (Allen, 1995). The earliest papers, such as Cohen 
and Perrault (1979), offered an AI planning model for how speech acts are 
generated. One agent, seeking to find out some information, could use stan¬ 
dard planning techniques to come up with the plan of asking the hearer to 
tell the speaker the information. Perrault and Allen (1980) and Allen and 
Perrault (1980) also applied this BDI approach to comprehension , specifi¬ 
cally the comprehension of indirect speech effects, essentially cashing out 
Searle’s (1975) promissory note in a computational formalism. 

WeTl begin by summarizing Perrault and Allen’s formal definitions of 
belief and desire in the predicate calculus. We’ll represent “S believes the 
proposition P” as the two-place predicate B(S,P). Reasoning about belief is 
done with a number of axiom schemas inspired by Hintikka (1969b) (such as 
B(A,P) A5(A, Q) =>> B(A,P A Q ); see Perrault and Allen (1980) for details). 
Knowledge is defined as ‘true belief’; S knows that P will be represented as 
KNOW(S,P), defined as follows: 

KNOW(S,P)=PAB(S,P) 

In addition to knowing that , we need to define knowing whether. S 
knows whether (KNOWIF) a proposition P is true if S KNOWs that P or S 
KNOWS that -.P: 



Section 19.3. Automatic Interpretation of Dialogue Acts 


731 


KNOWIF(S, P ) = KNOW(5, P) V KNOW(5, -P) 

The theory of desire relies on the predicate WANT. If an agent S wants 
P to be true, we say WANT(S,P ), or W(S,P) for short. P can be a state 
or the execution of some action. Thus if ACT is the name of an action, 

W(S, ACT(ff)) means that S wants H to do ACT. The logic of WANT relies 
on its own set of axiom schemas just like the logic of belief. 

The BDI models also require an axiomatization of actions and plan¬ 
ning; the simplest of these is based on a set of action schemas similar to the schema 
AI planning model STRIPS (Fikes and Nilsson, 1971). Each action schema 
has a set of parameters with constraints about the type of each variable, and 
three parts: 

• Preconditions: Conditions that must already be true in order to suc¬ 
cessfully perform the action. 

• Effects: Conditions that become true as a result of successfully per¬ 
forming the action. 

• Body: A set of partially ordered goal states that must be achieved in 
performing the action. 

In the travel domain, for example, the action of agent A booking flight FI 
for client C might have the following simplified definition: 

BOOK-FLIGHT(A,C,F): 

Constraints: Agent(A) A Flight(F) A Client(C) 

Precondition: Know(A,departure-date(F)) A Know(A,departure- 

time(F)) A Know(A,origin-city(F)) A 

Rnow(A,destination-city(F)) A Know(A,flight-type(F)) A 
Has-Seats(F) A W(C,(BOOK(A,C,F))) A ... 


Effect: Flight-Booked(A,C,F) 

Body: Make-Reservation(A,F,C) 

Cohen and Perrault (1979) and Perrault and Allen (1980) use this kind 
of action specification for speech acts. For example here is Perrault and 
Allen’s definition for three speech acts relevant to indirect requests. IN¬ 
FORM is the speech act of informing the hearer of some proposition (Austin/Searle’s 
Assertive , or DAMSL’s STATEMENT). The definition of INFORM is based 
on Grice’s (1957) idea that a speaker informs the hearer of something merely 
by causing the hearer to believe that the speaker wants them to know some¬ 
thing: 
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INFORM(S,H,P): 

Constraints: Speaker(S) A Hearer(H) A Proposition(P) 

Precondition: Know(S,P) A W(S, INFORM(S, H, P)) 

Effect: Know(H,P) 

Body: B(H,W(S,Know(H,P))) 

INFORMIF is the act used to inform the hearer whether a proposition 
is true or not; like INFORM, the speaker INFORMIFs the hearer by causing 
the hearer to believe the speaker wants them to KNOWIF something: 

INFORMIF(S,H,P): 

Constraints: Speaker(S) A Hearer(H) A Proposition(P) 

Precondition: KnowIf(S, P) A W(S, INFORMIF(S, H, P)) 

Effect: KnowIf(H, P) 

Body: B(H, W(S, KnowIf(H, P))) 

REQUEST is the directive speech act for requesting the hearer to per¬ 
form some action: 

REQUESTS,H,ACT): 

Constraints: Speaker(S) A Hearer(H) A ACT(A) A H is agent of ACT 
Precondition: W(S,ACT(H)) 

Effect: W(H,ACT(H)) 

Body: B(H,W(S,ACT(H))) 

Perrault and Allen’s theory also requires what are called ‘surface-level 
acts’. These correspond to the ‘literal meanings’ of the imperative, interroga¬ 
tive, and declarative structures. For example the ’surface-level’ act S.REQUEST 
produces imperative utterances: 

S.REQUEST (S, H, ACT): 
effect: B(H, W(S,ACT(H))) 

The effects of S.REQUEST match the body of a regular REQUEST, 
since this is the default or standard way of doing a request (but not the only 
way). This ‘default’ or ‘literal’ meaning is the start of the hearer’s inference 
chain. The hearer will be given an input which indicates that the speaker is 
requesting the hearer to inform the speaker whether the hearer is capable of 
giving the speaker a list: 

S.REQUEST(S,H,InformIf(H,S,CanDo(H,Give(H,S,LIST)))) 

The hearer must figure out that the speaker is actually making a re¬ 
quest: 

REQUEST(H,S ,Give(H,S ,LIST)) 
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The inference chain from the request-to-inform-if-cando to the request- 
to-give is based on a chain of plausible inference , based on heuristics called 
plan inference (PI) rules. We will use the following subset of the rules that inference 
P errault and Allen (1980) propose: 

• (PI.AE) Action-Effect Rule: For all agents S and H, if Y is an effect 
of action X and if H believes that S wants X to be done, then it is 
plausible that H believes that S wants Y to obtain. 

• (PI.PA) Precondition-Action Rule: For all agents S and H, if X is a 
precondition of action Y and if H believes S wants X to obtain, then it 
is plausible that H believes that S wants Y to be done. 

• (PI.BA) Body-Action Rule: For all agents S and H, if X is part of the 
body of Y and if H believes that S wants X done, then it is plausible 
that H believes that S wants Y done. 

• (PI.KP) Know-Desire Rule: For all agents S and H, if H believes S 
wants to KNOWIF(P), then H believes S wants P to be true: 

B(H,W(S, KNOWIF(S,/ > ))) B(H,W(S,P)) 

• (EI.1) Extended Inference Rule: if B(H, W(S, X )) p “ le B{II,W{S, Y)) 

is a PI rule, then 

is a PI rule. (i.e. you can prefix B(H , W ( S )) to any plan inference rule). 

Let’s see how to use these rules to interpret the indirect speech act in 
Can you give me a list of flights from Atlanta?. Step (0) in the table below 
shows the speaker’s initial speech act, which the hearer initially interprets 
literally as a question. Step (1) then uses Plan Inference rule Action-Effect , 
which suggests that if the speaker asked for something (in this case infor¬ 
mation), they probably want it. Step (2) again uses the Action-Effect rule, 
here suggesting that if the Speaker want an 1NFORMIF, and KNOWIF is an 
effect of INFORMIF, then the speaker probably also wants KNOWIF. 

Rule Step Result 

(0) S. REQUEST (S ,H, InformIf(H, S, CanDo(H, Give(H, S,LIST)))) 

PI.AE (1) B(H,W(S,InformIf(H,S,CanDo(H,Give(H,S,LIST))))) 

PI.AE/EI (2) B(H,W(S,KnowIf(H,S,CanDo(H,Give(H,S,LIST))))) 

PI.KP/EI (3) B(H,W(S,CanDo(H,Give(H,S,LIST)))) 

PI.PA/EI (4) B(H,W(S,Give(H,S,LIST))) 

PI.BA (5) REQUEST (H, S, Give(H, S,LIST)) 
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Step (3) adds the crucial inference that people don’t usually ask about 
things they aren’t interested in; thus if the speaker asks whether something is 
true (in this case CanDo), the speaker probably wants it (CanDo) to be true. 
Step (4) makes use of the fact that CanDo(ACT) is a precondition for (ACT), 
making the inference that if the speaker wants a precondition (CanDo) for 
an action (Give), the speaker probably also wants the action (Give). Finally, 
step (5) relies on the definition of REQUEST to suggest that if the speaker 
wants someone to know that the speaker wants them to do something, then 
the speaker is probably REQUESTing them to do it. 

In giving this summary of the plan-inference approach to indirect speech 
act comprehension, we have left out many details, including many necessary 
axioms, as well as mechanisms for deciding which inference rule to apply. 
The interested reader should consult Perrault and Allen (1980) and the other 
literature suggested at the end of the chapter. 

Cue-based interpretation of Dialogue Acts 

The plan-inference approach to dialogue act comprehension is extremely 
powerful; by using rich knowledge structures and powerful planning tech¬ 
niques the algorithm is designed to address even subtle indirect uses of dia¬ 
logue acts. The disadvantage of the plan-inference approach is that it is very 
time-consuming both in terms of human labor in development of the plan- 
inference heuristics, and in terms of system time in running these heuristics. 
In fact, by allowing all possible kinds of non-linguistic reasoning to play a 
part in discourse processing, a complete application of this approach is AI- 
ai complete complete. An Al-complete problem is one which cannot be truly solved 
without solving the entire problem of creating a complete artificial intelli¬ 
gence. 

Thus for many applications, a less sophisticated but more efficient 
data-driven method may suffice. One such method is a variant of the id¬ 
iom method discussed above. Recall that in the idiom approach, sentences 
like Can you give me a list of flights from Atlanta? have two literal mean¬ 
ings; one as a question and one as a request. This can be implemented in the 
grammar by listing sentence structures like Can you X with two meanings. 
The cue-based approach to dialogue act comprehension we develop in this 
section is based on this idiom intuition. 

A number of researchers have used what might be called a cue-based 
approach to dialogue act interpretation, although not under that name. What 
characterizes a cue-based model is the use of different sources of knowledge 
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(cues) for detecting a dialogue act, such as lexical, collocational, syntac¬ 
tic, prosodic, or conversational-structure cues. The models we will describe 
use (supervised) machine-learning algorithms, trained on a corpus of dia¬ 
logues that is hand-labeled with dialogue acts for each utterance. Which 
cues are used depends on the individual system. Many systems rely on the 
fact that individual dialogue acts often have what Goodwin (1996) called a 
microgrammar; specific lexical, collocation, and prosodic features which 
are characteristic of them. These systems also rely on conversational struc¬ 
ture. The dialogue-act interpretation system of Jurafsky et al (1997), for 
example, relies on 3 sources of information: 

1. Words and Collocations: Please or would you is a good cue for a 
REQUEST, are you for YES-NO-QUESTIONs. 

2. Prosody: Rising pitch is a good cue for a YES-NO-QUESTION. Loud¬ 
ness or stress can help distinguish the yeah that is an AGREEMENT 
from the yeah that is a BACKCHANNEL 

3. Conversational Structure: A yeah which follows a proposal is prob¬ 
ably an AGREEMENT; a yeah which follows an INFORM is probably 
a BACKCHANNEL. 

The previous section focused on how the plan-based approach figured 
out that a surface question had the illocutionary force of a REQUEST. In this 
section we’ll look at a different kind of indirect request; the CHECK, exam¬ 
ining the specific cues that the Jurafsky et al (1997) system uses to solve 
this dialogue act identification problem. Recall that a CHECK is a subtype 
of question which requests the interlocutor to confirm some information; the 
information may have been mentioned explicitly in the preceding dialogue 
(as in the example below), or it may have been inferred from what the inter¬ 
locutor said: 

A OPEN-OPTION I was wanting to make some arrangements for 
a trip that I’m going to be taking uh to LA uh 
beginning of the week after next. 

B HOLD OK uh let me pull up your profile and I’ll be 

right with you here, [pause] 

B CHECK And you said you wanted to travel next week? 

A ACCEPT Uh yes. 


Examples of possible realizations of CHECKs in English include: 
1. As tag questions: 


MICROGRAM¬ 

MAR 
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(19.11) From the Trains corpus (Allen and Core, 1997) 

U and it’s gonna take us also an hour to load boxcars right? 

S right 

2. As declarative questions, usually with rising intonation (Quirk et al ., 

1985b, p. 814) 

(19.12) From the Switchboard corpus (Godfrey etal. , 1992) 

A and we have a powerful computer down at work. 

B Oh (laughter) 

B so, you don’t need a personal one (laughter)? 

A No 

3. As fragment questions (subsentential units; words, noun-phrases, clauses) 

(Weber, 1993) 

(19.13) From the Map Task corpus (Carletta et al. , 1997) 

G Ehm, curve round slightly to your right. 

F To my right? 

G Yes. 

Studies of checks have shown that, like the examples above, they are 
most often realized with declarative structure (i.e. no aux-inversion), they are 
most likely to have rising intonation (Shriberg et al , 1998), and they often 
have a following question tag, often right , (Quirk etal., 1985b, 810-814), as 
in (19.11) above. They also are often realized as ‘fragments’ (subsentential 
words or phrases) with rising intonation (Weber, 1993). In Switchboard, the 
REFORMULATION subtype of CHECKs have a very specific microgram¬ 
mar, with declarative word order, often you as subject (31% of the cases), 
often beginning with so (20%) or oh , and sometimes ending with then. Some 
examples: 

Oh so you ’re from the Midwest too. 

So you can steady it. 

You really rough it then. 

Many scholars, beginning with Nagata and Morimoto (1994), realized 
that much of the structure of these microgrammars could be simply captured 
by training a separate word-A-gram grammar for each dialogue act (see e.g. 
Suhm and Waibel, 1994; Mast et al. , 1996; Jurafsky et al. , 1997; Wamke 
et al. , 1997; Reithinger and Klesen, 1997; Taylor et al. , 1998). These sys¬ 
tems create a separate mini-corpus from all the utterances which realize the 
same dialogue act, and then train a separate word-A-gram language model 
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on each of these mini-corpora. Given an input utterance u consisting of a 
sequence of words W , they then choose the dialogue act d whose A-gram 
grammar assigns the highest likelihood to W : 

d* = argmaxP(d|W) = argmaxP(d)P(fk|d) (19.14) 

d d 

This simple A-gram approach does indeed capture much of the micro¬ 
grammar; for example examination of the high-frequency bigram pairs in 
Switchboard reformulations shows that the most common bigrams in¬ 
clude good cues for reformulations like so you , sounds like , so you're , 
oh so , you mean , so they , and so it's. 

Prosodic models of dialogue act microgrammar rely on phonological 
features like pitch or accent, or their acoustic correlates like FO, duration, and 
energy discussed in Chapter 4 and Chapter 7. For example many studies have 
shown that capturing the rise in pitch at the end of YES-NO-QUESTIONS 
can be a useful cue for augmenting lexical cues (Sag and Liberman, 1975; 
Pierrehumbert, 1980; Waibel, 1988; Daly and Zue, 1992; Kompe et al , 

1993; Taylor et al. , 1998). Pierrehumbert (1980) also showed that declar¬ 
ative utterances (like STATEMENTS) have final lowering: a drop in FO at lowering 
the end of the utterance. One system which relied on these results, Shriberg 
et al. (1998), trained CART-style decision trees on simple acoustically-based 
prosodic features such as the slope of FO at the end of the utterance, the av¬ 
erage energy at different places in the utterance, and various duration mea¬ 
sures. They found that these features were useful, for example, in distin¬ 
guishing the four dialogue acts statement (S), yes-no question (QY), 
DECLARATIVE-QUESTIONS like CHECKS (QD) and WH-QUESTIONS (QW). 

Figure 19.3 shows the decision tree which gives the posterior probability 
P(d\ f) of a dialogue act d type given sequence of acoustic features F. Each 
node in the tree shows four probabilities, one for each of the four dialogue 
acts in the order S, QY, QW, QD; the most likely of the four is shown as the 
label for the node. Via the Bayes rule, this probability can be used to com¬ 
pute the likelihood of the acoustic features given the dialogue act: P{f\d). 

A final important cue for dialogue act interpretation is conversational 
structure. One simple way to model conversational structure, drawing on 
the idea of adjacency pairs (Schegloff, 1968; Sacks et al ., 1974) introduced 
above, is as a probabilistic sequence of dialogue acts. The identity of the 
previous dialogue acts can then be used to help predict upcoming dialogue 
acts. Many studies have modeled dialogue act sequences as dialogue-act-A- 
grams (Nagata and Morimoto, 1994; Suhm and Waibel, 1994; Wamke et al ., 
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Figure 19.3 Decision tree for the classification of STATEMENT (S), YES-NO QUES¬ 
TIONS (QY), WH-QUESTIONS (QW) and DECLARATIVE QUESTIONS (QD), after 
Shriberg et al (1998). Note that the difference between S and QY toward the right of the tree 
is based on the feature norm_f 0 _di f f (normalized difference between mean FO of end and 
penultimate regions), while the difference between WQ and QD at the bottom left is based 
on utt _grad, which measures FO slope across the whole utterance. 


1997; Chu-Carroll, 1998; Stolcke et al , 1998; Taylor et al , 1998); often as 
part of an HMM system for dialogue acts (Reithinger etal , 1996; Kita etal , 
1996; Woszczyna and Waibel, 1994). For example Woszczyna and Waibel 
(1994) give the dialogue HMM shown in Figure 19.4 for a Verbmobil-like 
appointment scheduling task. 

How does the dialogue act interpreter combine these different cues to 
find the most likely correct sequence of correct dialogue acts given a con¬ 
versation? Stolcke et al (1998) and Taylor et al (1998) apply the HMM 
intuition of Woszczyna and Waibel (1994) to treat the dialogue act detection 
process as HMM-parsing. Given all available evidence E about a conversa¬ 
tion, the goal is to find the dialogue act sequence D = {d\ .dn... ,<7v} that 
has the highest posterior probability P(D\E ) given that evidence (here we 
are using capital letters to mean sequences of things). Applying Bayes’ Rule 
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we get 


D* = 


WLgmaxP(D\E) 

D 


P(D)P(E\D) 

= "T* m 

= argmax/^D)/ 5 ^/)) 
D 


(19.15) 


Here P(D) represents the prior probability of a sequence of dialogue acts D. 
This probability can be computed by the dialogue act A-grams introduced 
by Nagata and Morimoto (1994). The likelihood P(E\D ) can be computed 
from the other two sources of evidence: the microsyntax models (for ex¬ 
ample the different word-A-gram grammars for each dialogue act) and the 
microprosody models (for example the decision tree for the prosodic fea¬ 
tures of each dialogue act). The word-A-grams models for each dialogue act 
can be used to estimate P{W\D ), the probability of the sequence of words W. 
The microprosody models can be used to estimate P(F\D), the probability 
of the sequence of prosodic features F. 

If we make the simplifying (but of course incorrect) assumption that 
the prosody and the words are independent, we can estimate the evidence 
likelihood for a sequence of dialogue acts D as follows: 


P(E\D) = P(F\D)P(W\D) (19.16) 

We can compute the most likely sequence of dialogue acts D* by sub¬ 
stituting equation (19.16) into equation (19.15), thus choosing the dialogue 
act sequence which maximizes the product of the three knowledge sources 
(conversational structure, prosody, and lexical/syntactic knowledge): 
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D* = argmaxP(D)P(F\D)P(W\D) 

D 

Standard HMM-parsing techniques (like Yiterbi) can then be used to 
search for this most-probable sequence of dialogue acts given the sequence 
of input utterances. 

The HMM method is only one way of solving the problem of data- 
driven dialogue act identification. The link with HMM tagging suggests 
another approach, treating dialogue acts as tags, and applying other part- 
of-speech tagging methods. Samuel et al (1998b), for example, applied 
Transformation-Based Learning to dialogue act tagging. 

Summary 

As we have been suggesting, the two ways of doing dialogue act interpre¬ 
tation (via inference and via cues) each have advantages and disadvantages. 
The cue-based approach may be more appropriate for systems which require 
relatively shallow dialogue structure which can be trained on large corpora. 
If a semantic interpretation is required, the cue-based approach will still need 
to be augmented with a semantic interpretation. The full inferential approach 
may be more appropriate when more complex reasoning is required. 


19.4 Dialogue Structure and Coherence 

Section 18.2 described an approach to determining coherence based on a set 
of coherence relations. In order to determine that a coherence relation holds, 
the system must reason about the constraints that the relation imposes on 
the information in the utterances. We will call this view the informational 
approach to coherence. Historically, the informational approach has been 
applied predominantly to monologues. 

The BDI approach to utterance interpretation gives rise to another view 
of coherence, which we will call the intentional approach. According to 
this approach, utterances are understood as actions, requiring that the hearer 
infer the plan-based speaker intentions underlying them in establishing co¬ 
herence. In contrast to the informational approach, intentional approach has 
been applied predominantly to dialogue. 

The intentional approach we describe here is due to Grosz and Sidner 
(1986), who argue that a discourse can be represented as a composite of three 
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interacting components: a linguistic structure, an intentional structure, structure 
and an attentional state. The linguistic structure contains the utterances in s™ n cIure l 
the discourse, divided into a hierarchical structure of discourse segments. swe tional 
(Recall the description of discourse segments in Chapter 18.) The atten¬ 
tional state is a dynamically-changing model of the objects, properties, and 
relations that are salient at each point in the discourse. This aligns closely 
with the notion of a discourse model introduced in the previous chapter. Cen¬ 
tering (see Chapter 18) is considered to be a theory of attentional state in this 
approach. 

We will concentrate here on the third component of the approach, the 
intentional structure, which is based on the BDI model of interpretation de¬ 
scribed in the previous section. The fundamental idea is that a discourse 
has associated with it an underlying purpose that is held by the person who 
initiates it, called the discourse purpose (DP). Likewise, each discourse seg- pu s r c p 0 0 u s r e se 
ment within the discourse has a corresponding purpose, called a discourse 

DISCOURSE 

segment purpose (DSP). Each DSP has a role in achieving the DP of the dis- segment 
course in which its corresponding discourse segment appears. Listed below 
are some possible DPs/DSPs that Grosz and Sidner give. 

1. Intend that some agent intend to perform some physical task. 

2. Intend that some agent believe some fact. 

3. Intend that some agent believe that one fact supports another. 

4. Intend that some agent intend to identify an object (existing physical 
object, imaginary object, plan, event, event sequence). 

5. Intend that some agent know some property of an object. 


As opposed to the larger sets of coherence relations used in informa¬ 
tional accounts of coherence, Grosz and Sidner propose only two such re¬ 
lations: dominance and satisfaction-precedence. DSPi dominates DSP 2 if 
satisfying DSP 2 is intended to provide part of the satisfaction of DSPi. DSPi 
satisfaction-precedes DSP 2 if DSPi must be satisfied before DSP 2 . 

As an example, let’s consider the dialogue between a client (C) and a 
travel agent (A) that we saw earlier, repeated here in Figure 19.5. 

Collaboratively, the caller and agent successfully identify a flight that 
suits the caller’s needs. Achieving this joint goal required that a top-level 
discourse intention be satisfied, listed as II below, in addition to several in¬ 
termediate intentions that contributed to the satisfaction of II, listed as 12-15. 


II: (Intend C (Intend A (A find a flight for C))) 

12: (Intend A (Intend C (Tell C A departure date))) 
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Ci: I need to travel in May. 

Ai: And, what day in May did you want to travel? 

C 2 : OK uh I need to be there for a meeting that’s from the 12th to the 

15th. 

A 2 : And you’re flying into what city? 

C3: Seattle. 

A 3 : And what time would you like to leave Pittsburgh? 

C 4 : Uh hmm I don’t think there’s many options for non-stop. 

A4: Right. There’s three non-stops today. 

C 5 : What are they? 

A 5 : The first one departs PGH at 10:00am arrives Seattle at 12:05 their 
time. The second flight departs PGH at 5:55pm, arrives Seattle at 
8 pm. And the last flight departs PGH at 8:15pm arrives Seattle at 
10:28pm. 

C&: OK I’ll take the 5ish flight on the night before on the 11th. 

Ae: On the 11th? OK. Departing at 5:55pm arrives Seattle at 8 pm, US 

Air flight 115. 

C 7 : OK. _ 

Figure 19.5 A fragment from a telephone conversation between a client (C) 

and a travel agent (A) (repeated from Figure 19.1). 


13: (Intend A (Intend C (Tell C A destination city))) 

14: (Intend A (Intend C (Tell C A departure time))) 

15: (Intend C (Intend A (A find a nonstop flight for C))) 

Intentions 12-15 are all subordinate to intention II, as they were all adopted 
to meet preconditions for achieving intention II. This is reflected in the 
dominance relationships below. 

II dominates 12 
II dominates 13 
II dominates 14 

11 dominates 15 

Furthermore, intentions 12 and 13 needed to be satisfied before intention 15, 
since the agent needed to know the departure date and destination city in 
order to start listing nonstop flights. This is reflected in the satisfaction- 
precedence relationships below. 

12 satisfaction-precedes 15 
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13 satisfaction-precedes 15 

The dominance relations give rise to the discourse structure depicted 
in Figure 19.6. Each discourse segment is numbered in correspondence with 
the intention number that serves as its DP/DSP. 


DS 1 

Cl ds 2 ds 3 ds 4 ds 5 

A1-C2 A2-C3 A3 C4-C7 


Figure 19.6 Discourse Structure of the Flight Reservation Dialogue 


On what basis does this set of intentions and relationships between 
them give rise to a coherent discourse? It is their role in the overall plan 
that the caller is inferred to have. There are a variety of ways that plans can 
be represented; here we will use the simple STRIPS model described in the 
previous section. We make use of two simple action schemas; the first is the 
one for booking a flight, repeated from page 731. 

BOOK-FLIGHT(A,C,F): 

Constraints: Agent(A) A Flight(F) A Client(C) 

Precondition: Know(A,departure-date(F)) A Know(A,departure- 

time(F)) A Know(A,origin-city(F)) A 

Know(A,destination-city(F)) A Know(A,flight-type(F)) A 
Has-Seats(F) A W(C,(BOOK(A,C,F))) A ... 


Effect: Flight-Booked(A,C,F) 

B ody: Make-Reservation( A,F,C) 


As can be seen, booking a flight requires that the agent know a variety 
of parameters having to do with the flight, including the departure date and 
time, origin and destination cities, and so forth. The utterance with which 
the caller initiates the example dialogue contains the origin city and partial 
information about the departure date. The agent has to request the rest; the 
second action schema we use represents a simplified view of this action (see 
Cohen and Perrault (1979) for a more in-depth discussion of planning wh- 
questions): 
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REQUEST-INFO(A,C,I): 

Constraints: Agent(A) A Client(C) 
Precondition: Know(C,I) 

Effect: Know(A,I) 

Body: B(C,W(A,Know(A,I))) 


SUBDIA¬ 

LOGUES 


INFORMATION¬ 

SHARING 

SUBDIA¬ 

LOGUES 


CORRECTION 

SUBDIA¬ 

LOGUE 


Because the effects of REQUEST-INFO match each precondition of 
BOOK-FLIGHT, the former can be used to serve the needs of the latter. Dis¬ 
course segments DS2 and DS3 are cases in which performing REQUEST- 
INFO succeeds for identifying the values of the departure date and desti¬ 
nation city parameters respectively. Segment DS4 is also a request for a 
parameter value (departure time), but is unsuccessful in that the caller takes 
the initiative instead, by (implicitly) asking about nonstop flights. Segment 
DS5 leads to the satisfaction of the top-level DP from the caller’s selection 
of a nonstop flight from a short list that the agent produced. 

Subsidiary discourse segments like DS2 and DS3 are also called sub¬ 
dialogues. The type of subdialogues that DS2 and DS3 instantiate are gener¬ 
ally called knowledge precondition subdialogues (Lochbaum et al , 1990; 
Lochbaum, 1998), since they are initiated by the agent to help satisfy pre¬ 
conditions of a higher-level goal (in this case addressing the client’s request 
for travel in May). They are also called information-sharing sub dialogues 
(Chu-Carroll and Carberry, 1998). 

Later on in a part of the conversation not given in Figure 19.5 is another 
kind of subdialogue, a correction sub dialogue (Litman, 1985; Litman and 
Allen, 1987). Utterances C 20 through C 23 ^ constitute a correction to the 
previous plan of returning on May 15: 


Ai 7 : And you said returning on May 15th? 

Cis: Uh, yeah, at the end of the day. 

A 19 : OK. There’s #two non-stops ... # 

C 20 : #Act... actually#, what day of the week is the 15th? 

A 21 : It’s a Friday. 

C 22 : Uh hmm. I would consider staying there an extra day til Sunday. 
A 23 ^: OK... OK. 

A 23 b- On Sunday I have ... 


CORRECTION 

SUBDIA- 

kfflffoTION 

SUBDIA¬ 

LOGUES 


Other kinds of subdialogues that have been addressed in the literature 
include subtask subdialogues (Grosz, 1974), which are used to deal with 
subtasks of the overall task in a task-oriented dialogue, and correction sub¬ 
dialogues (or negotiation sub dialogues) which are used to deal with con- 
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flicts or collaborative negotiation between the participants (Chu-Carroll and 
Carberry, 1998). 

Determining Intentional Structure Algorithms for inferring intentional 
structure in dialogue (and spoken monologue) work similarly to algorithms 
for inferring dialogue acts. Many algorithms apply variants of the BDI 
model (e.g. Litman, 1985; Grosz and Sidner, 1986; Litman and Allen, 1987; 

Carberry, 1990; Passonneau and Litman, 1993; Chu-Carroll and Carberry, 

1998). Others rely on similar cues to those described for utterance- and 
turn-segmentation on page 720, including cue words and phrases (Reich- 
man, 1985; Grosz and Sidner, 1986; Hirschberg and Litman, 1993), prosody 
(Grosz and Hirschberg, 1992; Hirschberg and Pierrehumbert, 1986; Hirschberg 
andNakatani, 1996), and other cues. For example Pierrehumbert and Hirschberg 
(1990) argue that certain boundary tones might be used to suggest a domi- tones dary 
nance relation between two intonational phrases. 

Informational versus Intentional Coherence As we just saw, the key to 

intentional coherence lies in the ability of the dialogue participants to rec¬ 
ognize each other’s intentions and how they fit into the plans they have. On 
the other hand, as we saw in the previous chapter, informational coherence 
lies in the ability to establish certain kinds of content-bearing relationships 
between utterances. So one might ask what the relationship between these 
are: does one obviate the need for the other, or do we need both? 

Moore and Pollack (1992), among others, have argued that in fact both 
levels of analysis must co-exist. Let us assume that after our agent and caller 
have identified a flight, the agent makes the statement in passage (19.17). 

(19.17) You’ll want to book your reservations before the end of the day. 

Proposition 143 goes into effect tomorrow. 

This passage can be analyzed either from the intentional or informational 
perspective. Intentionally, the agent intends to convince the caller to book 
her reservation before the end of the day. One way to accomplish this is to 
provide motivation for this action, which is the role served by uttering the 
second sentence. Informationally, the two sentences satisfy the Explanation 
relation described in the last chapter, since the second sentence provides a 
cause for the effect of wanting to book the reservations before the end of the 
day. 

Depending on the knowledge of the caller, recognition at the informa¬ 
tional level might lead to recognition of the speaker’s plan, or vice versa. 

Say, for instance, that the caller knows that Proposition 143 imposes a new 
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tax on airline tickets, but did not know the intentions of the agent in uttering 
the second sentence. From the knowledge that a way to motivate an action is 
to provide a cause that has that action as an effect, the caller can surmise that 
the agent is trying to motivate the action described in the first sentence. Al¬ 
ternatively, the caller might have surmised this intention from the discourse 
scenario, but have no idea what Proposition 143 is about. Again, knowing the 
relationship between establishing a cause-effect relationship and motivating 
something, the caller might be led to assume an Explanation relationship, 
which would require that she infers that the proposition is somehow bad for 
airline ticket buyers (e.g., a tax). Thus, at least in some cases, both levels of 
analysis appear to be required. 


19.5 Dialogue Managers in Conversational Agents 


SINGLE 

INITIATIVE 

SYSTEM 

INITIATIVE 


The idea of a conversational agent is a captivating one, and conversational 
agents like ELIZA, PARRY, or SHRDLU have become some of the best- 
known examples of natural language technology. Modem examples of con¬ 
versational agents include airline travel information systems, speech-based 
restaurant guides, and telephone interfaces to email or calendars. The dia¬ 
logue manager is the component of such conversational agents that controls 
the flow of the dialogue, deciding at a high level how the agents side of the 
conversation should proceed, what questions to ask or statements to make, 
and when to ask or make them. 

This section briefly summarizes some issues in dialogue manager de¬ 
sign, discussing some simple systems based on finite-state automata and pro¬ 
duction mles, and some more complex ones based on more sophisticated 
BDI-style reasoning and planning techniques. 

The simplest dialogue managers are based on finite-state automata. For 
example, imagine a trivial airline travel system whose job was to ask the user 
for a departure city, a destination city, a time, and any airline preference. Fig¬ 
ure 19.7 shows a sample dialogue manager for such a system. The states of 
the FSA correspond to questions that the dialogue manager asks the user, and 
the arcs correspond to actions to take depending on what the user responds. 

Systems which completely control the conversation in this way are 
called single initiative or system initiative systems. While this simple di¬ 
alogue manager architecture is sufficient for some tasks (for example for 
implementing a speech interface to an automatic teller machine or a simple 
geography quiz), it is probably too restricted for a speech based travel agent 
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not-city (answer) 


not-city(answer) 


^ ^What cit y are you leavin g from?^^ »^"Piease say the name of a city^ 

is-city (answer) not-city(answer) 



"Please say the name 


— 

of a city''j> J 


is-city(answer) 
not-time(answer) 


not-time(answer) 


say morning or evening 


■^Please 


Tfme(answer) 
not-yes-or-no(answer) 


£> 


n ot-yes-or- n o(a n swer) 


^jDoyou want to specify a carrier?" 


is-yes(answer) 


^^Which carrier do you prefer 



is-no(answer] 


lease say ‘yes^or^ 
is-yes(answer) 


no 


O 


is-no(answer) 

• • • 


Figure 19.7 A simple finite-state automaton architecture for a dialogue 
manager. 


system (see the discussion in McTear (1998)). One reason is that it is con¬ 
venient for users to use more complex sentences that may answer more than 
one question at a time, as in the following ATIS example: 

I want a flight from Milwaukee to Orlando one way leaving after 
five pm on Wednesday. 

Many speech-based question answering systems, beginning with the 
influential GUS system for airline travel planning (Bobrow et al , 1977), and 
including more recent ATIS systems and other travel and restaurant guides, 
are frame- or template-based. For example, a simple airline system might 
have the goal of helping a user find an appropriate flight. It might have a 
frame or template with slots for various kinds of information the user might 
need to specify. Some of the slots come with prespecified questions to ask 
the user: 

Slot Optional Question 

From_Airport “From what city are you leaving?” 

To_Airport “Where are you going?” 

Dep_time “When would you like to leave?” 

Arr_time “When do you want to arrive?” 

Fare_class 

Airline 

Oneway 


FRAME 

TEMPLATE 
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Such a simple dialogue manager may just ask questions of the user, 
filling out the template with the answers, until it has enough information to 
perform a data base query, and then return the result to the user. Not every 
slot may have a associated question, since the dialogue designer may not 
want the user deluged with questions. Nonetheless, the system must be able 
to fill these slots if the user happens to specify them. 

Even such simple domains require more than this single-template ar¬ 
chitecture. For example, there is likely to be more than one flight which 
meet the user’s constraints. This means that the user will be given a list of 
choices, either on a screen or, for a purely telephone interface, by listing 
them verbally. A template-based system can then have another kind of tem¬ 
plate which has slots for identifying elements of lists of flights {How much 
is the first one? or Is the second one non-stop?). Other templates might have 
general route information (for questions like Which airlines fly from Boston 
to San Francisco ?), information about airfare practices (for questions like 
Do I have to stay a specific number of days to get a decent airfare?) or about 
car or hotel reservations. Since users may switch from template to template, 
and since they may answer a future question instead of the one the system 
asked, the system must be able to disambiguate which slot of which tem¬ 
plate a given input is supposed to fill, and then switch dialogue control to 
that template. A template-based system is thus essentially a production rule 
system. Different types of inputs cause different productions to fire, each of 
which can flexibly fill in different templates. The production rules can then 
switch control based on factors such as the the user’s input and some simple 
dialogue history like the last question that the system asked. 

The template or production-rule dialogue manager architecture is often 
used when the set of possible actions the user could want to take is relatively 
limited, but where the user might want to switch around a bit among these 
things. 

The limitations of both the template-based and FSA-based dialogue 
managers are obvious. Consider the client’s utterance C 4 in the fragment of 
sample dialogue of Figure 19.5 on page 742, repeated here: 

A 3 : And what time would you like to leave Pittsburgh? 

C 4 : Uh hmm I don’t think there’s many options for non-stop. 

A 4 : Right. There’s three non-stops today. 

C 5 : What are they? 

A 5 : The first one departs PGH at 10:00am ... 

What the client is doing in C 4 is taking control or initiative of the 


INITIATIVE 
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dialogue. C 4 is an indirect request, asking the agent to check on non-stop 
flights. It would not be appropriate for the system to just set the wants 
non-stop field in a template and ask the user again for the departure time. 
The system needs to realize that the user has indicated that a non-stop flight 
is a priority and that the system should focus on that next. 

Conversational agents also need to use the grounding acts described 
on page 721. For example, when the user makes a choice of flights, it’s 
important for the agent to indicate to the client that it has understood this 
choice. Repeated below is an example of such grounding excerpted from 
our sample conversation: 

Ce: OK I’ll take the 5ish flight on the night before on the 11th. 

A 6 : On the 11th? OK. 

It is also important for a computational conversational agent to use 
requests for repairs, since given the potential for errors in the speech recog¬ 
nition or the understanding, there will often be times when the agent is con¬ 
fused or does not understand the user’s request. 

In order to address these and other problems, more sophisticated dia¬ 
logue managers can be built on the BDI (belief, desire, intention) architec¬ 
ture described on page 730. Such systems are often integrated with logic- 
based planning models, and treat a conversation as a sequence of actions to 
planned. 

Let’s consider the dialogue manager of the TRAINS-93 system; the 
system is described in Allen et al (1995), the dialogue manager in Traum 
and Allen (1994). The TRAINS system is a spoken-language conversational 
planning agent whose task is to assist the user in managing a railway trans¬ 
portation system in a microworld. For example, the user and the system 
might collaborate in planning to move a boxcar of oranges from one city to 
another. The TRAINS dialogue manager maintains the flow of conversation 
and addresses the conversational goals (such as coming up with a operational 
plan for achieving the domain goal of successfully moving oranges). To do 
this, the manager must model the state of the dialogue, its own intentions, 
and the user’s requests, goals, and beliefs. The manager uses a conversation 
act interpreter to semantically analyze the user’s utterances, a domain plan¬ 
ner and executer to solve the actual transportation domain problems, and a 
generator to generate sentences to the user. Figure 19.8 shows an outline of 
the TRAINS-93 dialogue manager algorithm. 

The algorithm keeps a queue of conversation acts it needs to generate. 
Acts are added to the queue based on grounding, dialogue obligations, or 
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Dialogue_Manager 

while conversation is not finished 
if user has completed a turn 
then interpret user’s utterance 
if system has obligations 
then address obligations 
else if system has turn 

then if system has intended conversation acts 
then call generator to produce NL utterances 
else if some material is ungrounded 
then address grounding situation 
else if high-level goals are unsatisfied 
then address goals 

else release turn or attempt to end conversation 
else if no one has turn 
then take turn 
else if long pause 
then take turn 


Figure 19.8 A dialogue manager algorithm, slightly modified from Traum 
and Allen ( 1994 ). 


the agent’s goals. Let’s examine each of these sources. Grounding acts were 
discussed on page 720; recall that a previous utterance can be grounded by an 
explicit backchannel (e.g. uh-huh, yeah, or under certain circumstances ok), 
or by repeating back part of the utterance. Utterances can also be grounded 
implicitly by ‘taking up’ the utterance, i.e. continuing in a way which makes 
it clear that the utterance was understood, such as by answering a question. 

Obligations are used in the TRAINS system to enable the system to 
correctly produce the second-pair part of an adjacency pair. That is, when a 
user REQUESTS something of the system (e.g. REQUEST(Give(List)), or 
REQUEST(InformIf(NonStop(FLIGHT-201)))), the REQUEST sets up an 
obligation for the system to address the REQUEST either by accepting it, 
and then performing it (giving the list or informing whether flight 201 is 
non-stop), or by rejecting it. 

Finally, the TRAINS dialogue manager must reason about its own 
goals. For the travel agent domain, the dialogue manager’s goal might be 
to find out the client’s travel goal and then create an appropriate plan. Let’s 
pretend that the human travel agent for the conversation in Figure 19.5 was 
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Methodology Box: Designing Dialogue Systems 

How does a dialogue system developer choose dialogue strategies, 
architectures, prompts, error messages, and so on? The three design 
principles of Gould and Lewis (1985) can be summarized as 

Key Concept #8. User-Centered Design: Study the user 
and task, build simulations and prototypes, and iteratively test 
them on the user and fix the problems. 

1. Early Focus on Users and Task: Understand the potential 
users and the nature of the task, via interviews with users and in¬ 
vestigation of similar systems. Study of related human-human dia¬ 
logues can also be useful, although the language in human-machine 
dialogues is usually simpler than in human-human dialogues (for 
example pronouns are rare in human-machine dialogue and are very 
locally bound when they do occur - Guindon, 1988). 

2. Build Prototypes: In the children's book The Wizard of 
Oz (Baum 1900), the Wizard turned out to he just a simulation 
controlled by a man behind a curtain. In Wizard-of-Oz (WOZ) or 
PNAMBIC (Pay No Attention to the Man Behind the Curtain) sys¬ 
tems, the users interact with what they think is a software system, 
but is in fact a human operator (‘wizard') behind some disguising 
interface software (e.g. Gould et al ., 1983: Good et al., 1984; Fraser 
and Gilbert, 1991) indexGood, M. D.. A WOZ system can be used 
to test out an architecture without implementing the complete sys¬ 
tem; only the interface software and databases need to be in place, it 
is difficult for the wizard to exactly simulate the errors, limitations, 
or time constraints of a real system; results of WOZ studies are thus 
somewhat idealized. 

3. Iterative Design: An iterative design cycle with embedded 
user testing is essential in system design (Nielsen, 1992; Cole et al., 
1994, 1997; Yankelovich et al., 1995; Landauer, 1995). For example 
Sfifclman etal. (1993) and Yankelovich etal. (1995) found that users 
of speech systems consistently tried to interrupt the system (barge 
in), suggesting a redesign of l he system to recognized overlapped 
speech. Kamm (1994) and Cole et al (1993) found that directive 
prompts (‘Say yes if you accept the call, otherwise, say no') or the 
use of constrained forms (Oviatt et al., 1993) produced better results 
than open-ended prompts like ‘Will you accept the call?’. 
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a system and explore what the state of a TRAINS-style dialogue manager 
would have to be to act appropriately. Let’s start with the state of the dia¬ 
logue manager (formatted following Traum and Allen (1994)) after the first 
utterances in our sample conversation (repeated here): 

Ci: I want to go to Pittsburgh in May. 

The client/user has just finished a turn with an inform speech act. 
The system has the discourse goal of finding out the user’s travel goal (e.g. 
‘Wanting to go to Pittsburgh on may 15 and returning... ’), and creating 
a travel plan to accomplish that goal. The following table shows the five 
parameters of the system state: the list of obligations, the list of intended 
speech acts to be passed to the generator, the list of the user’s speech acts 
that still need to be acknowledged, the list of discourse goals, and whether 
the system or the user holds the turn: 

Discourse obligations: NONE 

Turn holder: system 

Intended speech acts: N ONE 

Unacknowledged speech acts: INFORM-1 

Discourse goals: get-travel-goal, create-travel-plan 

After the utterance, the dialogue manager decides to add two conver¬ 
sation acts to the queue; first, to acknowledge the user’s inform act (via 
‘address grounding situation’), and second, to ask the next question of the 
user (via ‘address goals’). This reasoning would be worked out by the sys¬ 
tem’s STRIPS-style planner as described on page 743; given the goal get- 
travel-goal, the REQUEST-INFO action schema tells the system that asking 
the user something is one way of finding it out. The result of adding these 
two conversation acts is 

Intended speech acts: REQUEST-INFORM-1, ACKNOWLEDGE-1 

These would be combined by a very clever generator into the single 
utterance: 

A 2 : And, what day in May did you want to travel? 

Note that the grounding function was achieved both by beginning with the 
discourse marker and and by repeating back the month name May. The 
request for information is achieved via the wh-question. 

Let’s skip ahead to the client’s utterance C 4 . Recall that C 4 is an indi¬ 
rect request, asking the agent to check on non-stop flights. 

A 3 : And what time would you like to leave Pittsburgh? 
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C 4 : Uh hmm I don’t think there’s many options for non-stop. 

Let’s assume that our dialogue act interpreter correctly interprets C 4 as 
request-inform-3. The state of the agent after client utterance C 4 is then: 

Discourse obligations: address(REQUEST-INFORM-3) 

Turn holder: system 

Intended speech acts: NONE 

Unacknowledged speech acts: REQUEST-INFORM-3 
Discourse goals: get-travel-goal, create-travel-plan 

The dialogue manager will first address the discourse obligation of re¬ 
sponding to the user’s request by calling the planner to find out how many 
non-stop flights there are. The system must now answer the question, but 
must also ground the user’s utterance. For a direct request, the response is 
sufficient grounding. For an indirect request, an explicit acknowledgement 
is an option; since the indirect request was in the form of a negative check 
question, the form of acknowledgement will be right (no would have also 
been appropriate for acknowledging a negative. These two acts will then be 
pulled off the queue and passed to the generator: 

A 4 : Right. There’s three non-stops today. 

Dialogue managers also will need to deal with the kind of dialogue 
structure discussed in Section 19.4, both to recognize when the user has 
started a subdialogue, and to know when to initiate a subdialogue itself. 

19.6 SUMMARY 

Dialogue is a special kind of discourse which is particularly relevant to 
speech processing tasks like conversational agents and automatic meet¬ 
ing summarization. 

• Dialogue differs from other discourse genres in exhibiting turn-taking, 
grounding, and implicature. 

• An important component of dialogue modeling is the interpretation of 
dialogue acts. We introduced plan-based and cue-based algorithms 
for this. 

• Dialogue exhibits intentional structure in addition to the informa¬ 
tional structure, including such relations as dominance and satisfaction- 
precedence. 
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Methodology Box: Evaluating Dialogue Systems 

Many of the metrics that have been proposed for evaluating di¬ 
alogue systems can be grouped into the following three classes: 

1. User Satisfaction: Usually measured by interviewing users 
(Stifelman et al ., 1993; Yankclovich et al., 1995) or having them 
fill out questionnaires asking e.g, (Shriberg el al, 1992; Polifroni 
et al. , 1992): 

• Were answers provided quickly enough? 

• Did the system understand your requests the first Lime? 

• Do you think a person unfamiliar with computers could use the 
system easily? 

2. Task Completion Cost: 

• completion time in turns or seconds (Polifroni et al. 1992). 

• number of queries (Polifroni et al., 1992). 

• number of system non-responses (Polifroni et al. , 1992) or 
‘turn correction ratio’: the number of system or user turns that 
were used solely to correct errors, divided by the total num¬ 
ber of turns (Danicli and Gerbino, 1995; Hirschman and Pao, 
1993). 

• inappropriate ness (verbose or ambiguous) of system’s ques¬ 
tions, answers, and error messages (Zuc et al, 1989). 

3. Task Completion Success: 

• percent of subtasks that were completed (Polifroni et al, 
1992). 

• correctness (or partial correctness) of each question, answer, 
error message (Zuc et al, 1989; Polifroni et al, 1992), 

• correctness of the total solution (Polifroni et al, 1992). 

How should these metrics be combined and weighted? The 
PARADISE algorithm (Walker et al, 1997) (PARAdigm for Dia¬ 
logue System Evaluation) applies multiple regression to this prob¬ 
lem. The algorithm first uses questionnaires to assign each dialogue 
a user satisfaction rating. A set of cost and success factors like those 
above is then treated as a set of independent factors; multiple regres¬ 
sion is used to train a weight (coefficient) for each factor, measuring 
its importance in accounting for user satisfaction. The resulting met¬ 
ric can be used to compare quite different dialogue strategics. 
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• Dialogue managers for conversational agents range from simple template- 
or frame-based production systems to complete BDI (belief-desire- 
intention) models. 


Bibliographical and Historical Notes 

Early work on speech and language processing had very little emphasis on 
the study of dialogue. One of the earliest conversational systems, ELIZA, 
had only a trivial production system dialogue manager; if the human user’s 
previous sentence matched the regular-expression precondition of a possible 
response, ELIZA simply generated that response (Weizenbaum, 1966). The 
dialogue manager for the simulation of the paranoid agent PARRY (Colby 
et al , 1971), was a little more complex. Like ELIZA, it was based on a pro¬ 
duction system, but where ELIZA’s rules were based only on the words in the 
user’s previous sentence, PARRY’s rules also rely on global variables indi¬ 
cating its emotional state. Furthermore, PARRY’s output sometimes makes 
use of script-like sequences of statements when the conversation turns to its 
delusions. For example, if PARRY’s anger variable is high, he will choose 
from a set of ‘hostile’ outputs. If the input mentions his delusion topic, he 
will increase the value of his fear variable and then begin to express the 
sequence of statements related to his delusion. 

The appearance of more sophisticated dialogue managers awaited the 
better understanding of human-human dialogue. Studies of the properties 
of human-human dialogue began to accumulate in the 1970’s and 1980’s. 
The Conversation Analysis community (Sacks et al , 1974; Jefferson, 1984; 
Schegloff, 1982) began to study the interactional properties of conversation. 
Grosz’s (1977c) dissertation significantly influenced the computational study 
of dialogue with its introduction of the study of substructures in dialogues 
(subdialogues), and in particular with the finding that “task-oriented dia¬ 
logues have a structure that closely parallels the structure of the task being 
performed.” (p. 27). The BDI model integrating earlier Al planning work 
(Fikes and Nilsson, 1971) with speech act theory (Austin, 1962; Gordon and 
Lakoff, 1971; Searle, 1975a) was first worked out by Cohen and Perrault 
(1979), showing how speech acts could be generated, and Perrault and Allen 
(1980) and Allen and Perrault (1980), applying the approach to speech-act 
interpretation. 

The cue-based model of dialogue act interpretation was inspired by 
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Hinkelman and Allen (1989), who showed how lexical and phrasal cues 
could be integrated into the BDI model, and by the work on microgram¬ 
mar in the Conversation Analysis literature (e.g. Goodwin, 1996). It was 
worked out at a number of mainly speech recognition labs around the world 
in the late 1990’s (e.g. Nagata and Morimoto, 1994; Suhm and Waibel, 1994; 
Mast et al , 1996; Jurafsky et al , 1997; Wamke et al , 1997; Reithinger and 
Klesen, 1997; Taylor etal , 1998). 

Models of dialogue as collaborative behavior were introduced in the 
late 1980’s and 1990’s, including the ideas of reference as a collaborative 
process (Clark and Wilkes-Gibbs, 1986), and models of joint intentions 
(Levesque et al , 1990), and shared plans (Grosz and Sidner, 1980)). Re- 
initiative lated to this area is the study of initiative in dialogue, studying how the 
dialogue control shifts between participants Walker and Whittaker (1990), 
Smith and Gordon (1997). 


Exercises 

19.1 List the dialogue act misinterpretations in the Who's On First routine 
at the beginning of the chapter. 

19.2 Write a finite-state automaton for a dialogue manager for checking 
your bank balance and withdrawing money at an automated teller machine. 

19.3 Dispreferred responses (for example turning down a request) are usu¬ 
ally signaled by surface cues, such as significant silence. Try to notice the 
next time you or someone else utters a dispreferred response, and write down 
the utterance. What are some other cues in the response that a system might 
use to detect a dispreferred response? Consider non-verbal cues like eye- 
gaze and body gestures. 

19.4 When asked a question to which they aren’t sure they know the an¬ 
swer, people use a number of cues in their response. Some of these cues 
overlap with other dispreferred responses. Try to notice some unsure an¬ 
swers to questions. What are some of the cues? If you have trouble doing 
this, you may instead read Smith and Clark (1993) which lists some such 
cues, and try instead to listen specifically for the use of these cues. 

19.5 The sentence Do you have the ability to pass the salt? is not generally 
interpretable as a question. Why is this a problem for the BDI model? 
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19.6 Most universities require Wizard-of-Oz studies to be approved by a 
human subjects board, since they involve deceiving the subjects. It is a good 
idea (indeed it is often required) to ‘debrief’ the subjects afterwards and tell 
them the actual details of the task. Discuss your opinions of the moral issues 
involved in the kind of deceptions of experimental subjects that take place in 
Wizard-of-Oz studies. 

19.7 Implement a small air-travel help system. Your system should get 
constraints from the user about a particular flight that they want to take, 
expressed in natural language, and display possible flights on a screen. Make 
simplifying assumptions. You may build in a simple flight database or you 
may use an flight information system on the web as your backend. 

19.8 Augment your previous system to work over the phone (or alterna¬ 
tively, describe the user interface changes you would have to make for it to 
work over the phone). What were the major differences? 

19.9 Design a simple dialog system for checking your email over the tele¬ 
phone. Assume that you had a synthesizer which would read out any text 
you gave it, and a speech recognizer which transcribed with perfect accu¬ 
racy. If you have a speech recognizer or synthesizer, you may actually use 
them instead. 

19.10 Test your email-reading system on some potential users. If you don’t 
have an actual speech recognizer or synthesizer, simulate them by acting as 
the recognizer/synthesizer yourself. Choose some of the metrics described 
in the Methodology Box on page 754 and measure the performance of your 
system. 




GENERATION * 


20 


hello, world 

Kemighan & Ritchie, The C Programming Language 

... you, MR KEITH VLINDEN, will be a millionaire 
January 31! 

From a junk mailing 


In one sense, language generation is the oldest subfield of language 
processing. When computers were able to understand only the most unnat¬ 
ural of command languages, they were spitting out natural texts. For exam¬ 
ple, the oldest and most famous C program, the “hello, world” program, is 
a generation program. It produces useful, literate English in context. Unfor¬ 
tunately, whatever subtle or sublime communicative force this text holds is 
produced not by the program itself but by the author of that program. This 
approach to generation, called canned text, is easy to implement, but is un- canned text 
able to adapt to new situations without the intervention of a programmer. 

Language generation is also the most pervasive subfield of language 
processing. Who of us has not received a form letter with our name carefully 
inserted in just the right places, along with eloquent appeals for one thing or 
another. This sort of program is easy to implement as well, but I doubt 
if many are fooled into thinking that such a letter is hand-written English. 

The inflexibility of the mechanism is readily apparent when our names are 
mangled, as mine is in the junk mailing shown above, or when other obvious 


*This chapter by Keith Vander Linden 
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mistakes are made. 1 This approach, called template filling, is more flexible 
than canned text and has been used in a variety of applications, but is still 
limited. For example, Weizenbauirfs use of templates in ELIZA worked 
well in some situations, but produced nonsense in others. 2 

The success of simple generation mechanisms indicates that, to a first 
approximation, language generation is easier than language understanding. 
A language understanding system cannot generally control the complexity 
of the language structures it receives as input, while a generation system can 
limit the complexity of the structure of its output. Because of this, work in 
language processing initially focussed on language understanding, assuming 
that any generation that needed to be done could easily be handled with 
canned text or template filling mechanisms. Unfortunately, these simple 
mechanisms are not flexible enough to handle applications with any real¬ 
istic variation in the information being expressed and in the context of its 
expression. Even the generation used in the limited domain of the “hello, 
world” program could use more flexibility. It might be more appropriate for 
the program to produce: 

(20.1) Congratulations, you’ve just compiled and run a simple C program 
which means that your environment is configured properly. 

This text is more complex than the original and we can see a number of po¬ 
tential variations. If the readers are experienced systems engineers, then we 
might choose not to congratulate them on compiling a program. Doing so 
might insult them. In addition, there are a number of other ways of referring 
to the program in question, including “the program”, “your first C program”, 
“it” and “that lame excuse for code”, each of which might or might not be 
appropriate in a given situation. Note also that we didn’t need to conjoin 
“compiled and run” and we could have expressed the conclusion (“your en¬ 
vironment is configured properly”) and then the evidence (“you’ve compiled 
and run a simple C program”). These variations are not necessarily synony¬ 
mous. Language has evolved a variety of ways of expressing things, all of 

1 A recent example of this arose when parents of a college-bound student received an accep¬ 
tance letter that read “Congratulations on 987-65-4321’s admission ... as a parent you will be 
a partner with the university in encouraging 987-65-4321 to succeed.” (from the Associated 
Press, March 26, 1999). 

2 It’s not difficult to “trick” ELIZA into producing nonsense, as shown in the following dia¬ 
log with ELIZA 4.3: HUMAN : “Let’s talk just about you — not me. Can you think?” ELIZA: 
“You like to think I — not you — don’t you” ... HUMAN : “I think you are an extremely stupid 
program.” ELIZA: “Perhaps you would like to be an extremely stupid program” (Copeland, 
1993, page 40). 
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which are useful for expressing certain things in certain contexts. Therefore, 
to be useful in more complex environments, a generation system must be 
capable of: (1) producing an appropriate range of forms, and (2) choosing 
among those forms based on the intended meaning and the context. In this 
chapter, we will study the basic language generation techniques used to solve 
these problems, ignoring canned text and template-based mechanisms. 

20.1 Introduction to Language Generation 

Language understanding is somewhat like counting from one to 
infinity; language generation is like counting from infinity to one . 
Yorick Wilks, quoted in (Dale et al , 1998a, page 352) 

Generation from what?! 

attributed to Christopher Longuet-Higgins 


Natural Language Generation (NLG) is the process of constructing wnguage 
natural language outputs from non-linguistic inputs. The goal of this process NATURAL 
can be viewed as the inverse of that of natural language understanding under age 
(NLU) in that NLG maps from meaning to text, while NLU maps from text STANDING 
to meaning. In doing this mapping, generation visits many of the same lin¬ 
guistic issues discussed in the previous chapters, but the inverse orientation 
distinguishes its methods from those of NLU in two important ways. 

First, the nature of the input to the generation process varies widely 
from one application to the next. Although the linguistic input to NLU sys¬ 
tems may vary from one text type to another, all text is governed by relatively 
common grammatical rules. This is not the case for the input to generation 
systems. Each generation system addresses a different application with a dif¬ 
ferent input specification. One system may be explaining a complex set of 
numeric tables while another may be documenting the structure of an object- 
oriented software engineering model. As a result, generation systems must 
extract the information necessary to drive the generation process. 

Second, while both NLU and NLG must be able to represent a range 
of lexical and grammatical forms required for the application domain, their 
use of these representations is different. NLU has been characterized as a 
process of hypothesis management in which the linguistic input is sequen¬ 
tially scanned as the system considers alternative interpretations. Its domi- 
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nant concerns include ambiguity, under-specification, and ill-formed input. 
These concerns are not generally addressed in generation research because 
they don’t arise. The non-linguistic representations input to an NLG sys¬ 
tem tend to be relatively unambiguous, well-specified, and well-formed. In 
contrast, the dominant concern of NLG is choice. Generation systems must 
make the following choices: 

• Content selection — The system must choose the appropriate content 
to express from a potentially over-specified input, basing its decision 
on a specific communicative goal. For example, we noted that some of 
the content included in example 20.1 might not be appropriate for all 
readers. If the goal was to indicate that the environment is set up, and 
the reader was a systems engineer, then we’d probably express only the 
last clause. 

• Lexical selection — The system must choose the lexical item most 
appropriate for expressing particular concepts. In example 20.1, for 
instance, it must choose between the word “configured” and other po¬ 
tential forms including “set up”. 

• Sentence structure 

- Aggregation — The system must apportion the selected content 
into phrase, clause, and sentence-sized chunks. Example 20.1 
combined the actions of compiling and running into a single phrase. 

- Referring expressions — The system must determine how to re¬ 
fer to the objects being discussed. As we saw, the decision on 
how to refer to the program in example 20.1 was not trivial. 

• Discourse structure — NLG systems frequently deal with multi-sentence 
discourse, which must have a coherent, discernible structure. Exam¬ 
ple 20.1 included two propositions in which it was clear that one was 
giving evidence for the other. 

These issues of choice, taken together with the problem of actually putting 
linear sequences of words on paper, form the core of the field of NLG. 
Though it is a relatively young field, it has begun to develop a body of work 
directed at this core. This chapter will introduce this work. It will begin by 
presenting a simple architecture for NLG systems and will then proceed to 
discuss the techniques commonly used in the components of that architec¬ 
ture. 
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Figure 20.1 A reference architecture for NLG systems 


20.2 An Architecture eor Generation 


The nature of the architecture appropriate for accomplishing the tasks listed 
in the previous section has occasioned much debate. Practical considera¬ 
tions, however, have frequently led to the architecture shown in Figure 20.1, 

This architecture contains two pipelined components: 

* Discourse Planner - This component starts with a communicative [Ser se 
goal and makes all the choices discussed in the previous section. It 
selects the content from the knowledge base and then structures that 
content appropriately. The resulting discourse plan will specify all the 
choices made for the entire communication, potentially spanning mul¬ 
tiple sentences and including other annotations (including hypertext, 
figures, etc,). 

* Surface Realizer — This component receives the fully specified dis- peauz^r 
course plan and generates individual sentences as constrained by its 
lexical and grammatical resources. These resources define the real¬ 
ized potential range of output, if the plan specifies multiple-sentence 
output, the surface realizer is called multiple times. 
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20.3 


FUNCTION 

FORM 


This is by no means the only architecture that has been proposed for NLG 
systems. Other potential mechanisms include Al-style planning and black¬ 
board architectures. Neither is this architecture without its problems. The 
simple pipeline, for example, doesn’t allow decisions made in the planner to 
be reconsidered during surface realization. Furthermore, the precise bound¬ 
ary between planning and realization is not altogether clear. Nevertheless, 
we will use it to help organize this chapter. We’ll start by discussing the sur¬ 
face realizer, the most developed of the two components, and then proceed 
to the discourse planner. 


Surface Realization 

The surface realization component produces ordered sequences of words as 
constrained by the contents of a lexicon and grammar. It takes as input 
sentence-sized chunks of the discourse specification. This section will in¬ 
troduce two of the most influential approaches used for this task: Systemic 
Grammar and Functional Unification Grammar. Both of these approaches 
will be used to generate the following example: 

(20.2) The system will save the document. 

There is no general consensus as to the level at which the input to 
the surface realizer should be specified. Some approaches specify only the 
propositional content, so in the case of example 20.2, the discourse plan 
would specify a saving action done by a system entity to a document entity. 
Other approaches go so far as to include the specification of the grammatical 
form (in this case, a future tense assertion) and lexical items (in this case, 
“save”, “system”, and “document”). 

As we will see, systems using the two approaches discussed in this sec¬ 
tion take input at different levels. One thing they have in common, however, 
is that they take input that is functionally specified rather than syntactically 
specified. This fact, which is typical of generation systems, has tended to 
preclude the use of the syntactic formalisms discussed earlier in this book. 
Generation systems start with meaning and context, so it is most natural to 
specify the intended output in terms of function rather than of form. Ex¬ 
ample 20.2, for instance, could be stated in either active or passive form. 
Discourse planners tend not to work with these syntactic terms. They are 
more likely to keep track of the focus or local topic of the discourse, and 
thus it is more natural to specify this distinction in terms of focus. So in 
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the example, if the document is the local topic of the discourse, it would be 
marked as the focus which could trigger the use of the passive. As we will 
see, both of the approaches discussed here categorize grammar in functional 
terms. 


Systemic Grammar 

Systemic grammar is part of Systemic-Functional linguistics, a branch of functional 
linguistics that views language as a resource for expressing meaning in con¬ 
text (Halliday, 1985b). Systemic grammars represent sentences as collec¬ 
tions of functions and maintain rules for mapping these functions onto ex¬ 
plicit grammatical forms. This approach is well-suited to generation and has 
thus been widely influential in NLG. This section will start with an example 
of systemic sentence analysis. It will then discuss a simple systemic gram¬ 
mar and apply it to the running example. 

Systemic sentence analyses organize the functions being expressed in 
multiple “layers”, as shown in this analysis of example 20.2: 

The system will save the document 

Mood 
Transitivity 
Theme 



Here, the mood layer indicates a simple declarative structure with subject, fi¬ 
nite (auxiliary), predicator (verb) and object. The transitivity layer indicates 
that the “system” is the actor, or doer, of the process of “saving”, and that the 
goal, or object acted upon, is the “document”. 3 The theme layer indicates 
that the “system” is the theme, or focus of attention, of the sentence. 4 Notice 
that the three layers deal with different sets of functions. These three sets, 
called meta-functions, represent three fundamental concerns in generation: 

• The interpersonal meta-function groups those functions that estab¬ 
lish and maintain the interaction between the writer and the reader. It 
is represented here by the mood layer, which determines whether the 
writer is commanding, telling, or asking. 

• The ideational meta-function is concerned with what is commonly 

3 These thematic roles are discussed in Chapter 16. 

4 The concepts of theme and rheme were developed by the Prague school of linguistics 


META¬ 

FUNCTIONS 

INTERPER¬ 

SONAL 

META¬ 

FUNCTION 


IDEATIONAL 

META¬ 

FUNCTION 
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called the “propositional content” of the expression. Here, the transi¬ 
tivity layer determines the nature of the process being expressed and 
the variety of case roles that must be expressed. Note that this meta¬ 
function covers much of what is commonly termed “semantics”. 

meta ual • The textual meta-function is concerned with the way in which the ex- 

FUNCTION J 

pression fits into the current discourse. This includes issues of thema- 
tization and reference. In our example, the theme layer represents this 
in that it explicitly marks “the system” as the theme of the sentence. 


SYSTEM 

NETWORK 


REALIZATION 

STATEMENTS 


This explicit concern for interpersonal and textual issues as well as tradi¬ 
tional semantics is another feature of systemic linguistics that is attractive 
for NLG. Many of the choices that generation systems must make depend on 
the context of communication, which is formalized by the interpersonal and 
textual metafunctions. 

A systemic grammar is capable of building a sentence structure such 
as the one just shown. The grammar is represented using a directed, acyclic, 
and/or graph called a system network. Figure 20.2 illustrates a simple sys¬ 
tem network. Here, the large curly brace indicates “and” (i.e., parallel) sys¬ 
tems, while the straight vertical lines represent “or” (i.e., disjoint) systems. 
Thus, every clause (represented as the highest level feature on the far left) 
will simultaneously have a set of features for mood, transitivity and theme, 
but will either be indicative or imperative but not both. Although the sys¬ 
tem network formalism doesn’t require the use of systemic theory, we will 
loosely base this sample grammar on systemic categorizations. With respect 
to this grammar, example 20.2 is an indicative, declarative clause expressing 
an active material process with an unmarked theme. 

A systemic grammar uses realization statements to map from the 
features specified in the grammar (e.g., Indicative, Declarative) to syntac¬ 
tic form. Each feature in the network can have a set of realization statements 
specifying constraints on the final form of the expression. These are shown 
in Figure 20.2 as a set of italicized statements below each feature. Realiza¬ 
tion statements allow the grammar to constrain the structure of the expres¬ 
sion as the system network is traversed. They are specified using a simple 
set of operators shown here: 


+X Insert the function X. For example, the grammar in Figure 20.2 speci¬ 
fies that all clauses will have a predicator. 

X/Y Conflate the functions X and T. This allows the grammar to build a 


(Firbas, 1966). 
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Mood 


Clause 

+predicator 
predicator / ver9\ 


Transitivit y 


Indicative 

+subject 


Indicative 


Type 


subject > predicator 
+fin de¬ 
finite > predicator 
finite / auxiliary 
subject /noun phrase 

Imperative 

predicator / inf in itive 


Declarative 

subject > finite 


Interrogative 

finite > subject 


Interrogative 


Type 


Voice 


Material Process 

+goal 

+process 

process = finite, predicator 

Relational Process 


Wh- 

+question 
question > finite 
question / Wh- 

Polar 


Active 

+actor 

actor = subject 
+object 
object = goal 
predicator > object 
object/noun phrase 

Passive 

goal = subject 
finite /be 

predicator / past-participle 


Theme 


Unmarked Theme 

+theme +rheme 
theme — subject 
rheme = predicator, object 

Marked Theme 


Figure 20.2 A simple systemic grammar 


layered function structure by assigned different functions to the same 
portion of the expression. For example, active clauses conflate the 
actor with the subject, while passive clauses conflate the goal with the 
subject. 

AT Order function X somewhere before function Y. For example, indica¬ 
tive sentences place the subject before the predicator. 

X : A Classify the function X with the lexical or grammatical feature A. These 
classifications signal a recursive pass through the grammar at a lower 
level. The grammar would include other networks similar to the clause 
network that apply to phrases, lexical items, and morphology. As an 
example, note that the indicative feature inserts a subject function that 
must be a noun phrase. This phrase will be further specified by another 
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pass through the grammar. 

X\L Assign function X the lexical item L. In Figure 20.2, the finite element 
of the passive is assigned the lexical item “be”. 

Given a fully specified system network, the procedure for generation is to: 

1. Traverse the network from left to right, choosing the appropriate fea¬ 
tures and collecting the associated realization statements; 

2. Build an intermediate expression that reconciles the constraints set by 
the realization statements collected during this traversal; 

3. Recurse back through the grammar at a lower level for any function 
that is not fully specified; 

To illustrate this process, we will use the sample grammar to generate exam¬ 
ple 20.2 (“The system will save the document”). We will use the following 
specification as input: 5 

( 

:process 
:actor 
: goal 

:speechact 
:tense 
) 

Here, the save-1 knowledge base instance is identified as the process of 
the intended expression. We will assume all knowledge base objects to 
be KLONE-styled instances (Brachman, 1979) for which proper lexical en¬ 
tries exist. The actor and goal are similarly specified as system-1 and 
document-1 respectively. The input also specifies that the expression be 
in the form of an assertion in the future tense. 

The generation process starts with the clause feature in Figure 20.2, in¬ 
serting a predicator and classifying it as a verb. It then proceeds to the mood 
system. The correct option for a system is chosen by a simple query or de¬ 
cision network associated with that system. The query or decision network 
bases its decision on the relevant information from the input specification 
and from the knowledge base. In this case, the mood system chooses the 
indicative and declarative features because the input specifies an assertion. 

5 This input specification is loosely based on the spl-constructor interface to the PENMAN 
system (Mann, 1983), a systemic generation system. The Sentence Planning Language 
(SPL), a more flexible input language, is discussed in the bibliographical notes below. 


save-1 

system-1 

document-1 

assertion 

future 
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The realization statements associated with the indicative and declarative fea¬ 
tures will insert subject and finite functions, and order them as subject then 
finite then predicator. The resulting function structure would be as follows: 


Mood 


subject 


finite 


predicator 


We will assume that the save-1 action is marked as a material process 
in the knowledge base, which causes the transitivity system to choose the 
material process feature. This inserts the goal and process functions, and 
conflates the process with the finite/predicator pair. Because there is no indi¬ 
cation in either the input or the knowledge base to use a passive, the system 
chooses the active feature, which: (1) inserts the actor and conflates it with 
the subject, and (2) inserts the object, conflating it with the goal and ordering 
it after the predicator. This results in: 


Mood 

Transitivity 


subject 

finite 

predicator 

object 

actor 

process 

goal 


Finally, because there is no thematic specification in the input, the theme 
network chooses unmarked theme, which inserts theme and rheme, conflat¬ 
ing theme with subject and conflating rheme with the finite/predicator/object 
group. This results in the full function structure discussed above (repeated 
here): 


Mood 

Transitivity 

Theme 


At this point, the generation process recursively enters the grammar a num¬ 
ber of times at lower levels to fully specify the phrases, lexical items, and 
morphology. The noun phrase network will use a process like the one shown 
here to create “the system” and “the document”. Systems in the auxiliary 
network will insert the lexical item “will”. The choice of the lexical items 
“system”, “document”, and “save” can be handled in a number of ways, most 
typically by retrieving the lexical item associated with the relevant knowl¬ 
edge base instances. 


subject 

finite predicator 

object 

actor 

process 

goal 

theme 

rheme 
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Functional Unification Grammar 

Functional Unification Grammar uses unification (discussed in Chapter 11) 
to manipulate and reason about feature structures (Kay, 1979). With a few 
modifications, this technique can be applied to NLG. The basic idea is to 
build the generation grammar as a feature structure with lists of potential al¬ 
ternations, and then to unify this grammar with an input specification built 
using the same sort of feature structure. The unification process then takes 
the features specified in the input and reconciles them with those in the gram¬ 
mar, producing a full feature structure which can then be linearized to form 
sentence output. 

In this section we will illustrate this mechanism by generating exam¬ 
ple 20.2 again. We will use the simple functional unification grammar shown 
in Figure 20.3. This grammar, expressed as an attribute-value matrix (cf. 
Chapter 11), supports simple transitive sentences in present or future tense 
and enforces subject-verb agreement on number. WeTl now walk through 
the structure, explaining the features. 

At its highest level, this grammar provides alternatives for sentences 
(cat s), noun phrases (cat np) and verb phrases (cat vp). This alternation 
is specified with the alt feature on the far left. We use the curly braces to 
indicate that any one of the three enclosed alternatives may be followed. This 
level also specifies a pattern that indicates the order of the features specified 
at this level, in this case, actor, process, then goal. 

At the sentence level, this grammar supports actor, process, and goal 
features which are prespecified as NP, VP and NP respectively. Subject-verb 
agreement on number is enforced using the number feature inside the process 
feature. Here we see that the number of the process must unify with the path 
{actor number}. A path is a list of features specifying a path from the root to 
a particular feature. In this case, the number of the process must unify with 
the number of the actor. While this path is given explicitly, we can also have 
relative paths such as the number feature of the head feature of the NP. The 
path here, {t t number}, indicates that the number of the head of the noun 
phrase must unify with the number of the feature 2 levels up. WeTl see how 
this is useful in the example below. 

The VP level is similar in nature to the NP level except that it has its 
own alternation between present and future tense. Given the tense, which we 
will see specified in the input feature structure, the unification will select the 
alternation that matches and then proceed to unify the associated features. If 
the tense is present, for example, the head will be single verb. If, on the other 
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ALT < 


CAT 

ACTOR 

PROCESS 

GOAL 


CAT NPj 
CAT VP 

NUMBER |aCTOR NUMBERj 
CAT NpI 


PATTERN (ACTOR PROCESS GOAL) 


CAT 


HEAD 


NP 

CAT NOUN 

NUMB ER { 1 11 NUMB ER ! 


DETERMINER 

PATTERN 


jtt number| 


CAT ARTICLE 


(DETERMINER HEAD) 


CAT VP 


ALT < 


TENSE PRESENT 

CAT VERB 

NUMBER (tt NUMBER! 
PATTERN (HEAD) 


HEAD 


jtt NUMBER j 


TENSE 

AUXILIARY 


FUTURE 
CAT MODAL 
LEX “WILL” 


HEAD 


CAT VERB 

ENDING ROOT 

PATTERN (AUXILIARY HEAD) 


Figure 20.3 A simple FUF grammar. 
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hand, the tense is future, we will insert the modal auxiliary “will” before the 
head verb. 

This grammar is similar to the systemic grammar from the previous 
section in that it supports multiple levels that are entered recursively during 
the generation process. We now turn to the input feature structure, which 
specifies the details of the particular sentence we want to generate. The 
input structure, called a functional description (FD), is a feature structure 
just like the grammar. An FD for example 20.2 is as follows: 


CAT 

ACTOR 

PROCESS 

GOAL 


S 


HEAD |LEX SYSTEM 

HEAD [lEX SAVE 
TENSE FUTURE 


HEAD 


LEX DOCUMENT 


Here we see a sentence specification with a particular actor, the system, and 
a particular goal, the document. The process is the saving of the document 
by the system in the future. The input structure specifies the particular verbs 
and nouns to be used as well as the tense. This differs from the input to 
the systemic grammar. In the systemic grammar, the lexical items were re¬ 
trieved from the knowledge base entities associated with the actor and goal. 
The tense, though not included in the example systemic grammar, would 
be determined by a decision network that distinguishes the relative points 
in time relevant to the content of the expression. This unification grammar, 
therefore, requires that more decisions be made by the discourse planning 
component. 

To produce the output, this input is unified with the grammar shown in 
Figure 20.3. This requires multiple passes through the grammar. The pre¬ 
liminary unification unifies the input FD with the “S” level in the grammar 
(i.e., the first alternative at the top level). The result of this process is as 
follows: 
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PATTERN (ACTOR PROCESS GOAL) 


Here we see that the features specified in the input structure have been 
merged and unified with the features at the top level of the grammar. For 
example, the features associated with “actor” include the lexical item “sys¬ 
tem” from the input FD and the category “np” from the grammar. Similarly, 
the process feature combines the lexical item and tense from the input FD 
with the category and number features from the grammar. 

The generation mechanism now recursively enters the grammar for 
each of the sub-constituents. It enters the NP level twice, once for the actor 
and again for the goal, and it enters the YP level once for the process. The 
FD that results from this is shown in Figure 20.4. There we see that every 
constituent feature that is internally complex has a pattern specification, and 
that every simple constituent feature has a lexical specification. The system 
now uses the pattern specifications to linearize the output, producing “The 
system will save the document.” 

This particular example did not specify that the actor be plural. We 
could do this by adding the feature-value pair “number plural” to the actor 
structure in the input FD. Subject-verb agreement would then be enforced 
by the unification process. The grammar requires that number of the heads 
of the NP and the VP match with the number of the actor that was specified 
in the input FD. The details of this process are left as an exercise. 
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Figure 20.4 The fully unified FD 
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Summary 

The two surface generation grammars we’ve seen in this section illustrate the 
nature of computational grammars for generation. Both used functional cat¬ 
egorizations. One might wonder if it would be possible to use a single gram¬ 
mar for both generation and understanding. These grammars, called bidi- 

BIDIREC 

rectional grammars, are currently under investigation but have not found tional 

GRAMMj 

widespread use in NLG (cf. Chapter 21). This is largely due to the additional 
semantic and contextual information required as input to the generator. 

20.4 Discourse Planning 

The surface realization component discussed in the previous section takes 
a specified input and generates single sentences. Thus, it has little or no 
control over either the discourse structure in which the sentence resides or the 
content of the sentence itself. These things are controlled by the discourse 
planner. This section will introduce the two predominant mechanisms for 
building discourse structures: text schemata and rhetorical relations. 

The focus on discourse rather than just sentences has been a key fea¬ 
ture of much work done in NLG. Many applications require that the system 
produce multi-sentence or multi-utterance output. This can be done by sim¬ 
ply producing a sentence for each component of the intended meaning, but 
frequently more care is required in selecting and structuring the meaning in 
an appropriate way. For example, consider the following alternate revision 
of the “hello, world” output discussed in the introduction: 

(20.3) You’ve just compiled a simple C program. You’ve just run a simple 
C program. Your environment is configured properly. 

These sentences are fine in isolation, but the text is more disjointed than the 
one given in example 20.1 and is probably harder to understand. Although 
it orders the sentences in a helpful way, it doesn’t give any indication of the 
relationship between them. These are the sorts of issues that drive discourse 
planning. 

This section will also discuss the closely related problem of content 
selection, which, as we saw earlier, is the process of selecting propositional 
content from the input knowledge base based on a communicative goal. Be¬ 
cause the form of this knowledge base and the nature of the communicative 
goal varies widely from one application to another, it is difficult to make 
general statements about the content selection process. To make things 
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more concrete, therefore, this section will focus on the task of generating 
instructions for a simple word-processing application. We’ll assume that 
the knowledge base, whatever its underlying structure, can be viewed as a 
KLONE-styled knowledge base. We’ll also assume that the communicative 
goal is to explain the represented procedure to a new user of the system. 
The knowledge base will represent the procedure for saving a file as a sim¬ 
ple procedural hierarchy, as shown in Figure 20.5. The procedure specified 
there requires that the user choose the save option from the file menu, se¬ 
lect the appropriate folder and file name, and then click on the save button. 
As a side-effect, the system automatically displays and removes the save-as 
dialog box in response to the appropriate user actions. This representation 
gives the procedural relationships between the basic actions but it doesn’t 
show any of the domain knowledge concerning the structure of the interface 
(e.g., which choices are on which menus) or the particular entities that are 
used in the procedure (e.g., the document, the user). We’ll assume that these 
are accessible in the knowledge base as well. 

Text Schemata 

Apart from the rigidly structured canned texts and slot-filler templates dis¬ 
cussed in the opening of this chapter, the simplest way to build texts is to 
key the text structure to the structure of the input knowledge base. For ex¬ 
ample, we might choose to describe a game of tic-tac-toe or checkers by 
reviewing the moves in the sequence in which they were taken. This strategy 
soon breaks down, however, when we have a large amount of information 
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that could potentially be expressed in order to achieve a variety of commu¬ 
nicative goals. The knowledge base that contains the fragment shown in 
Figure 20.5, for example, could be expressed as a sequence of instructions 
such as one might find in a tutorial manual, or it could be expressed as an 
alphabetized set of program functions such as one might find in a reference 
manual. 

One approach to this problem rests on the observation that texts tend 
to follow consistent structural patterns. For example, written directions ex¬ 
plaining how to carry out an activity typically express the required actions 
in the order of their execution. Any preconditions of these actions are men¬ 
tioned before the appropriate action. Similarly, side-effects of these actions 
are mentioned after the appropriate action. In some domains, patterns such 
as these are rarely broken. Armed with this information, we can build a 
schema representing this structure, such as the one shown in Figure 20.6. schema 
This schema is represented as an augmented transition network (ATN) in ^ranItion 
which each node is a state and each arc is an optional transition (see Chap¬ 
ter 10). Control starts in the small black node in the upper left and proceeds 
to follow arcs as appropriate until execution stops in the terminal node of the 
lower left. Node SO allows the expression of any number of preconditions. 
Transitioning to SI forces the expression of the action itself. SI allows re¬ 
cursive calls to the network to express any sub-steps. The transition to S2 
requires no action, and S2 allows any number of side-effects to be expressed 
before halting execution. 

We can use this schema to plan the expression of the example proce¬ 
dure shown in Figure 20.5. When the system is asked to describe how to 
save a document, the procedure schema can be activated. WeTl assume that 
the knowledge base specifies no preconditions for the action of saving a file, 
so we proceed directly to state SI, forcing the expression of the main action: 

“Save the document”. In state S2, we recursively call the network for each 
of the four sub-steps specified in the input. This expresses the first sub-step, 

“choose the save option”, along with its side-effect, “this causes the system 
to display the save-as dialog box”. The first sub-step has no preconditions 
or sub-steps. Each of the other sub-steps is done in the same manner and 
execution finally returns to the main action execution in step S2 which ex¬ 
presses the result of the whole process, “this causes the system to save the 
document” and then terminates. Depending on the details of the planning, 
the final text might be as follows: 


Save the document: First, choose the save option from the 
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file menu. This causes the system to display the Save-As dialog 
box. Next, choose the destination folder and type the filename. 
Finally, press the save button. This causes the system to save the 
document. 


Each one of these sentences can be generated using one of the surface realiz¬ 
es discussed in the previous section. As we can see, the schema mechanism 
is more flexible than templates or canned text. It structures the output accord¬ 
ing to known patterns of expression, but, with appropriate constraints, is able 
to insert optional material collected from the knowledge base in a variety of 
orders. In addition, it is not required to express everything in the knowledge 
base; the side-effect of the “click save button” action, for example, was not 
included. 

This schema mechanism produced only a high-level discourse struc¬ 
ture. The problem of specifying of the detailed form of each of the sentences, 
commonly called microplanning, is discussed in Section 20.5. 
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Rhetorical Relations 

Schemata are useful for discourse planning provided a discrete set of consis¬ 
tent patterns of expression can be found and encoded. However, they suffer 
from two basic problems. First, they become impractical when the text be¬ 
ing generated requires more structural variety and richness of expression. 

For example, we may find that certain conditions dictate that we format our 
procedural instructions in a different manner. Some contexts may dictate that 
we explicitly enumerate the steps in the procedure, or that we express cer¬ 
tain segments of the text in a different manner or in a different order. While 
in principle these variations could be supported either by adding constraints 
and operational code to the schema or by adding new schemata, the more 
variations that are required, the more difficult the schema-based approach 
becomes. 

The second problem with schema-based mechanisms is that the dis¬ 
course structure they produce is a simple sequence of sentence generation 
requests. It includes no higher-level structure relating the sentences together. 

In some domains, particularly in interactive ones (cf. Chapter 19), the struc¬ 
ture of the previous discourse is relevant for future planning. For example, 
if we have explained a process in some detail, we might not want to do it 
again. It’s easier to do these things when there is a record of the structure of 
previous discourse. 

A useful approach here is to take a look under the hood of the schema in 
order to discover the more fundamental rhetorical dynamics at work in a text. 

A system informed by these dynamics could develop its own schemata based 
on the situations it confronts. A number of theories that attempt to formalize 
these rhetorical dynamics have been proposed, as discussed in some detail 
in Chapter 18. One such theory, Rhetorical Structure Theory (RST), is a structure 
descriptive theory of text organization based on the relationships that hold 
between parts of the text (Mann and Thompson, 1987b). As an example, 
consider the following two texts: 

(20.4) I love to collect classic automobiles. My favorite car is my 1899 

Duryea. 

(20.5) I love to collect classic automobiles. My favorite car is my 1999 

Toyota. 

The first text makes sense. The fact that the writer likes the 1899 Duryea 
follows naturally from the fact that they like classic automobiles. The sec¬ 
ond text, however, is problematic. The problem is not with the individual 
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sentences, they work perfectly well in isolation. Rather, the problem is with 
their combination. The fact that the two sentences are in sequence implies 
that there is some coherent relationship between them. In the case of the 
first text, that relationship could be characterized as one of elaboration (cf. 
Chapter 19). The second text could be characterized as one of contrast and 
would thus be more appropriately expressed as: 

(20.6) I love to collect classic automobiles. However, my favorite car is my 
1999 Toyota. 

Here, the “however”, overtly signals the contrast relation to the reader. RST 
claims that an inventory of 23 rhetorical relations, including ELABORATION 
and contrast, is sufficient to describe the rhetorical structure a wide vari¬ 
ety of texts. In practice, analysts tend to make use of a subset of the relations 
that are appropriate for their domain of application. 

Most RST relations designate a central segment of text (“I love to col- 
nucleus lect... ”), called the nucleus, and a more peripheral segment (“My favorite 

satellite car is... ”), called the satellite. This encodes the fact that many rhetorical re¬ 

lations are asymmetric. Here the second text is being interpreted in terms of 
the first, and not vice-versa. As we will see below, not all rhetorical relations 
are asymmetric. RST relations are defined in terms of the constraints they 
place on the nucleus, on the satellite, and on the combination of the nucleus 
and satellite. Here are definitions of some common RST relations: 

elaboration — The satellite presents some additional detail concerning 
the content of the nucleus. This detail may be of many forms: 

• a member of a given set 

• an instance of a given abstract class 

• a part of a given whole 

• a step of a given process 

• an attribute of a given object 

• a specific instance of a given generalization 

contrast — The nuclei present things that, while similar in some re- 
nuclear spects, are different in some relevant way. This relation is multi-nuclear in 
that it doesn’t distinguish between a nucleus and a satellite. 

condition — The satellite presents something that must occur before the 
situation presented in the nucleus can occur. 

purpose — The satellite presents the goal of performing the activity pre¬ 
sented in the nucleus. 
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sequence — This relation is multi-nuclear. The set of nuclei are realized 
in succession. 

result — The situation presented in the nucleus results from the one pre¬ 
sented in the satellite. 

RST relations are typically graphed as follows: 



classic automobiles, is my 1899 Duryea. 

Here we see a graphical representation of the rhetorical relation from exam¬ 
ple 20.4. The segments of text are ordered sequentially along the bottom of 
the diagram with the rhetorical relations built above them. The individual 
text segments are usually clauses. 

Rhetorical structure analyses are built up hierarchically, so we may 
use one pair of related texts as a satellite or nucleus in another higher-level 
relation. Consider the following three-sentence structure: 



Here we see that the first two clauses are related to one another via an elabo¬ 
ration relationship, and are related, as a pair, to the third clause via a contrast 
relationship. Note also how the multi-nuclear contrast relation is depicted. 
Recursive structuring such as this allows RST to build a single analysis tree 
for extended texts. 

Although RST was originally proposed as a descriptive tool, it can also 
be used as a constructive tool for NLG. In order to do this, the rhetorical 
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relations are typically recast as operators for an Al-style planner. As an 
example of this, we will look at a general-purpose, top-down, hierarchical 
planner that can be used for rhetorically-based text planning. 6 

The basic approach with this sort of planner is for the generation sys¬ 
tem to post a high level communicative goal stated in terms of the effect that 
the text should have on the reader. For our instructional text example, we will 
request that the planner build a structure to achieve the goal of making the 
reader competent to save a file. The highest level plan operator that achieves 
this goal will insert a rhetorical node appropriate for the goal and insert sub¬ 
goals for the nucleus and satellite of that rhetorical relation. These sub-goals 
will then be recursively expanded until the planning process reaches the bot¬ 
tom of the rhetorical structure tree, inserting a node that can be expressed as 
a simple clause. 

For our example, we would post the goal: 

(COMPETENT hearer (DO-ACTION <some-action>)) 

Here, the communcative goal is to make the hearer competent to do some 
action. The action would be represented as an instance in the knowledge 
base, in this case, as the root node from the procedural hierarchy shown in 
Figure 20.5. A text plan operator that would fire for this goal would be as 
follows: 

Name: Expand Purpose 
Effect: 

(COMPETENT hearer (DO-ACTION ?action)) 

Constraints: 

(AND 

(c-get-all-substeps ?action ?sub-actions) 

(NOT (singular-list? ?sub-actions)) 

Nucleus: 

(COMPETENT hearer (DO-SEQUENCE ?sub-actions)) 

Satellites: 

(((RST-PURPOSE (INFORM s hearer (DO ?action))) 

Required*)) 


The basic idea of this plan operator is to explain how to do a particular action 
(“?action”) by explaining how to do its substeps (“?substeps”)- Note that the 
effect field matches the goal we posted earlier. An operator is applicable 

6 This text planner is adapted from the work of Moore and Paris (1993). 
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when its constraints hold. In this case, the main action (“?action”) must have 
more than one sub-action. Because this is true in the current example (see 
Figure 20.5), the operator inserts a rhetorical purpose node into the discourse 
structure along with the goal specifications for its satellite and nucleus. The 
satellite informs the hearer of the purpose of performing the main action, 
and the nucleus lists the sub-actions required to achieve this goal. Note that 
the effect, constraints, nucleus and satellite fields of the operator make use 
of variables (identifiers starting with “?”) that are unified when the operator 
is applied. Thus, the goal action is bound to “?action” and can be accessed 
throughout the rest of the plan operator. 

One other thing to notice about the plan operator is the way in which 
content selection is done. The constraint field specifies that there must be 
substeps and that there must be more than one of them. Determining whether 
the first constraint holds requires that the system retrieve the sub-steps from 
the knowledge base. These sub-steps are then used as the content of the 
nucleus node that is constructed. Thus, the plan operators themselves do the 
content selection as required by the discourse planning process. 

The full text structure produced by the planner is shown in Figure 20.7. 
The root node of this tree (i.e., the horizontal line at the very top) is the node 
produced by the previous plan operator. The first nucleus node in Figure 20.7 
is the multi-nuclear node comprising all the sub-actions. The plan operator 
that produces this node is as follows: 


Name: Expand Sub-Actions 
Effect: 

(COMPETENT hearer (DO-SEQUENCE ?actions)) 

Constraints: 

NIL 

Nucleus: 

(foreach ?actions (RST-SEQUENCE 

(COMPETENT hearer (DO-ACTION ?actions)))) 

Satellites: 

NIL 


This operator achieves the nucleus goal posted by the previous operator. It 
posts a rhetorical node with multiple nuclei, one for each sub-action required 
to achieve the main goal. With an appropriate set of plan operators, this 
planning system can produce the discourse structure shown in Figure 20.7, 
which could then be linearized into the following text: 
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Choose Show Click Save 

Save Save-as Save Document 

Dialog box 

Figure 20.7 The full rhetorical structure for the example text 


To save a new file 

1. Choose the save option from the file menu. 

The system will display the save-file dialog box. 

2. Choose the folder. 

3. Type the file name. 

4. Click the save button. 

The system will save the document. 

All of these sentences can be generated by a surface realizer. The last 
one, in particular, was identified as example 20.2 in the previous sections. As 
mentioned in the section on schema-based discourse planning, the problem 
of microplanning has been deferred to Section 20.5. 

Summary 

In this section, we have seen how schema-based mechanisms can take ad¬ 
vantage of consistent patterns of discourse structure. Although this approach 
has proven effective in the many contexts, it is not flexible enough to handle 
more varied generation tasks. Discourse planning based on rhetorical rela¬ 
tions was introduced to add the flexibility required to handle these sorts of 
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tasks. 


20.5 Other Issues 


This section introduces issues that were not discussed in detail in the previ¬ 
ous sections. 


Microplanning 


The previous sections did not detail the process of mapping from the dis¬ 
course plans described in the examples to the inputs to the surface realizers. 
The discourse structures, such as the one shown in Figure 20.7, specified 
the high-level or macro structure of the text, but few of the details expected 
as input to the surface realizers. The problem of doing this more detailed 
planning is called microplanning. 

In most generation applications, microplanning is simply hard-wired. 
For example, in instruction generation systems, objects can be referred to in 
the same way in all cases, and user actions can be expressed as separate im¬ 
perative sentences. This greatly simplifies the problem, but tends to produce 
monotonous texts such as the one shown in example 20.3. This illustrates 
two of the primary areas of concern in microplanning: referring expres¬ 
sions and aggregation. 

Planning a referring expression requires that we determine those as¬ 
pects of an entity that should be used when referring to that entity in a par¬ 
ticular context. If the object is the focus of discussion and has just been 
mentioned, we might be able to use a simple “it”, whereas introducing a new 
entity may require more elaborate expressions like “a new document to hold 
your term paper”. These issues are discussed in some detail in Chapter 18. 

Aggregation is the problem of apportioning the content from the knowl¬ 
edge base into phrase, clause, and sentence-sized chunks. We saw an exam¬ 
ple of this in the introduction where two of the actions mentioned in exam¬ 
ple 20.1 were conjoined within the first clause as “you’ve just compiled and 
run a simple C program”. This is more readable than the non-aggregated 
version given in example 20.3 (“You’ve just compiled a simple C program. 
You’ve just run a simple C program”). 

Microplanning is frequently seen as an intermediate pipelined mod¬ 
ule placed between the discourse planner and the surface realizer (see Fig¬ 
ure 20.1) (Reiter and Dale, 2000). Indeed, more recent work has emphasized 
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microplanning to the point that it is viewed as a task of importance equal to 
that of discourse planning and surface realization. It is also possible to add 
planning operators to the RST-based planning mechanism described in the 
chapter in order to perform microplanning tasks. However the microplan¬ 
ning is done, it serves to map from the output of the discourse planner to the 
input of the surface realizer. 

Lexical Selection 

Lexical selection refers to the general problem of choosing the appropriate 
words with which to express the chosen content. The surface realizers dis¬ 
cussed in this chapter explicitly inserted closed-class lexical items as they 
were required, but deferred the choice of the content words to the discourse 
planner. Many planners simplify this issue by associating a single lexical 
item with each entity in the knowledge base. 

Handling lexical selection in a principled way requires that the gener¬ 
ation system deal with two issues. First, it must be able to choose the appro¬ 
priate lexical item when more than one alternative exists. In the document¬ 
saving text from the previous section, for instance, the system generated 
“Click the save button”. There are alternatives to the lexical item “click”, 
including “hit” and “press mouse left on”. The choice between these alterna¬ 
tives could consider: (1) style — in this case “hit” is perhaps more informal 
that “click”, (2) collocation — in this case “click” probably co-occurs with 
buttons more often in this domain, and (3) user knowledge — in this case a 
novice computer user might need the more fully specified “press mouse left 
on”. 

Second, the generation system must be able to choose the appropriate 
grammatical form for the expression of the concept. For example, the system 
could title the section “Saving a new file” rather than “To save a new file”. 
This choice between the participle and the infinitive form is frequently made 
based on the forms most commonly employed in a corpus of instructions. 

Evaluating Generation Systems 

In early work on NLG, the quality of the output of the system was assessed 
by the system builders themselves. If the output sounded good, then the sys¬ 
tem was judged a success. Because this is not a very effective test of system 
quality, much recent interest has been focussed on the rigorous evaluation of 
NLG systems. Several techniques have emerged. 
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One technique is to statistically compare the output of the generator 
with the characteristics of a corpus of target text. If the form chosen by 
the generator matches the form most commonly used in the corpus, it is 
judged as correct. The danger with this approach is that the corpus is usually 
produced by writers that may make errors, thus skewing the corpus statistics. 

The assumption is that, as Tolstoy put it (Tolstoy, 1977), “All happy families 
are alike, but an unhappy family is unhappy after its own fashion.” In other 
words, good text displays a consistent set of characteristics that arise again 
and again, while bad text displays idiosycratic characteristics that will not 
accumulate statistically. 

Another technique is to convene a panel of experts to judge the output 
of the generator in comparison with text produced by human authors. In this 
variation of the Turing test, the judges do not know which texts were gen¬ 
erated by the system and which were written by human authors. Computer 
generated text typically scores lower than human written text, but its quality 
approaches that of human authors in some restricted domains. 

A final technique is to judge how effective the generated text is at 
achieving its goal. For example, if the text is intended to describe some 
object, its quality can be measured in terms of how well readers score on 
a content quiz given after reading the output text. If the text is intended to 
explain how to perform some process, its quality can be measured in terms 
of the number of procedural errors made by the reader after reading the text. 

Generating Speech 

This chapter has focussed on generating text rather than on generating speech. 

There are, however, many situations in which speech output is preferable if 
not absolutely necessary. These include situations where there is no textual 
display, such as when the user is using a telephone, and situations where the 
users are unable to look at a textual display, such as when the user is driving 
or when the user is disabled. 

A simplistic approach might be to pass the word string that is produced 
by a generation system to a text-to-speech synthesizer of the sort described 
in Chapter 4, Chapter 5, and Chapter 7. One problem with this approach was 
already discussed on page 120 and page 601: text-to-speech systems must 
then deal with homographs (i.e., words with the same spelling but different graphs 
pronunciations). Consider the following example: 

(20.7) Articulate people can clearly articulate the issues. 

Here, the two instances of the spelling “articulate” must be pronounced dif- 
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ferently. Another problem is the treatment of prosody, which requires that 
appropriate pitch contours and stress patterns be assigned to the speech being 
produced. 

The simplistic approach requires the text-to-speech system to solve 
both of these problems by analyzing the input text. Homographs can fre¬ 
quently be distinguished using part-of-speech tagging (the adjective and verb 
forms of “articulate” are pronounced differently) or by the word-sense dis¬ 
ambiguation algorithms of Chapter 17. As Chapter 4 (page ??) suggests, 
automatic generation of prosody is a much harder problem. Some prosodic 
information can be deduced by distinguishing questions from non-questions, 
and by looking for commas and periods. In general, however, it is not easy 
to extract the required information from the input text. 

An alternative to the simplistic approach is to pass a richer representa¬ 
tion from the NLG system to the speech synthesizer. A typical NLG system 
knows the semantics and part of speech of the word it intends to generate, 
and can annotate the word with this information to help select the proper 
word pronunciation. The system could also annotate the output with dis¬ 
course structure information to help synthesize the proper prosody. To date, 
there has been very little work on this area in NLG. 


20.6 Summary 

Language Generation is the process of constructing natural language outputs 
from non-linguistic inputs. As a field of study, it usually does not include the 
study of simpler generation mechanisms such as canned text and template 
filling. 

• Language generation differs from language understanding in that it fo¬ 
cuses on linguistic choice rather than on resolving ambiguity. Issues 
of choice in generation include content selection, lexical selection, 
aggregation, referring expression generation, and discourse struc¬ 
turing. 

• Language generation systems include a component that plans the struc¬ 
ture of the discourse, called a discourse planner, and one that gener¬ 
ates single sentences, called a surface realizer. Approaches for dis¬ 
course planning include text schemata and rhetorical relation plan¬ 
ning. Approaches for surface realization include Systemic Grammar 
and Functional Unification Grammar. 
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• Microplanners map the discourse planner output to the surface gener¬ 
ator input, which includes the fine-grained tasks of referring expres¬ 
sion generation, aggregation, and lexical selection. 


Bibliographical and Historical Notes 

Excluding canned text and template filling mechanisms, natural language 
generation is a young field relative to the rest of language processing. Some 
minor forays into the field occurred in the 50’s and 60’s, mostly in the con¬ 
text of machine translation, but work focusing on generation didn’t arise until 
the 70’s. Simmons and Slocum’s system (1972) used ATN’s to generate dis¬ 
course from semantic networks, Goldman’s babel (1975) used decision net¬ 
works to perform lexical choice, and Davey’s PROTEUS (1979) produced de¬ 
scriptions of tic-tac-toe games. The 80’s saw the establishment of generation 
as a distinct field of research. Influential contributions on surface realization 
were made by McDonald (1980) and the penman project (Mann, 1983), and 
on text planning by McKeown (1985) and Appelt (1985). The 90’s have seen 
continuing interest with the rise of generation-focussed workshops, both Eu¬ 
ropean and international, and organizations (cf. the Special Interest Group 
on language GENeration, http://www.aclweb.org/siggen). Kukich (1988) 
and Reiter and Dale (2000) have discussed the uses and limitations of canned 
text and template mechanisms. 

As of this writing, no textbooks on generation exist. However, a text 
on applied generation is in press (Reiter and Dale, 2000), and a number of 
survey papers have been written (Dale et al , 1998a; Uszkoreit, 1996; Mc¬ 
Donald, 1992; Bateman andHovy, 1992; McKeown and Swartout, 1988). A 
number of these references discuss the history of NLG and its relationship to 
the rest of language processing. McDonald (1992) introduces the distinction 
between hypothesis management and choice. 

Generation architectures have typically pipelined the tasks of planning 
and realization. The pipelining is used to constrain the search space within 
each of the modules and thus to make the generation task more tractable (Re¬ 
iter and Dale, 2000; McDonald, 1988; Thompson, 1977). However, these 
architectures have the well-known problem that decisions made by the dis¬ 
course planner cannot easily be undone by the realizer (Meteer, 1992). Ap- 
pelt’s kamp (1985) employed a unified architecture for planning and realiza¬ 
tion based on AI planning. This approach, however, has proven computation- 
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ally impractical in larger domains. Blackboard architectures have also been 
proposed for language generation systems (Nirenburg et al , 1989). The var¬ 
ious concerns of microplanning itself have been the subject of considerable 
interest, including work on referring expressions (Dale, 1992; Appelt, 1985), 
aggregation (Dalianis, 1999; Mann and Moore, 1981), and other grammat¬ 
ical issues (Vander Linden and Martin, 1995; Meteer, 1992). The related 
issues of lexical selection (Stede, 1998; Reiter, 1990; Goldman, 1975) and 
tailoring the output text to particular audiences (Paris, 1993; Hovy, 1988a) 
have also received attention. 

The late 80’s and early 90’s saw the construction of several reusable 
NLG systems, including two that have been distributed publicly: kpml 
(Bateman, 1997) and fuf (Elhadad, 1993). These tools can be downloaded 
through the SIGGEN web site. Most of this work was done in Lisp, but 
recent efforts have been made to port the systems to other languages and 
platforms. 

Systemic functional linguistics (SFL) was developed by Halliday (1985b). 
It has remained largely independent of generative linguistics and is relatively 
unknown in the language processing community as a whole. Attempts to 
use it in parsing have had limited success (O’Donnell, 1994; Kasper, 1988). 
However, it has had a deep impact on NLG, being used in one form or an¬ 
other by a number of generation systems, including Winograd’s shrdlu 
(1972b), Davey’s proteus, Patten’s slang (1988), penman (Mann, 1983), 
fuf (Elhadad, 1993) and ilex (Dale et al , 1998b). The example systemic 
grammar in this chapter is based in part on Winograd’s discussion (1972b). 
SFL’s most complete computational implementation is the Komet-Penman 
MultiLingual development environment (kpml), which is a descendent of 
PENMAN. KPML is packaged with NIGEL, a large English generation gram¬ 
mar, as well as an environment for developing multilingual grammars. It 
also includes a Sentence Planning Language (spl) that forms a more usable 
interface to the systemic grammar itself, spl specifications are considerably 
simpler to build than specifications that must include all the information re¬ 
quired to make all the choices in the system network, but are more flexible 
that the spl-constructor example given in the chapter. Consider the following 
spl specification: 


(si / save 




:actor 

(al 

/ 

system 



: 

determiner the) 

:actee 

(a2 

/ 

document 
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:determiner the) 

:tense future 

) 

The SPL interpreter will expand this into the series of feature choices re¬ 
quired for the Nigel grammar to generate example 20.2 (“The system will 
save the document.”). Each term in this specification gives the role of the 
entity (e.g., actor, actee) as well as the semantic type (e.g., save, system, 
document). The semantic types are KLONE-styled concepts subordinated to 
a general ontology (cf. Chapter 16) of concepts called the upper model ^del 
(Bateman et al , 1990). This ontology, which represents semantic distinc¬ 
tions that have grammatical consequences, is used by SPL to determine the 
type of entity being expressed and thus to reduce the amount of information 
explicitly contained in the SPL specification. This example leaves out the 
: speechact assertion term included in the example in the chapter 
because SPL uses this as a default value if left unspecified. 

Functional Unification Grammar was developed by Kay (1979), see 
Chapter 11. Its most influential implementation for generation is the Func¬ 
tional Unification Formalism (pul) developed by Elhadad (Elhadad, 1993, 

1992). It is distributed with the English grammar surge. Although the 
example given in the chapter used a simple phrase-structure approach to 
grammatical categorization (cf. (Elhadad, 1992)), the surge grammar uses 
systemic categorizations. 

Another linguistic theory that has been influential in language gener¬ 
ation is Mel’cuk’s Meaning Text Theory (mtt) (1988). mtt postulates a 
number of levels ranging from deep syntax all the way to surface structure. 
Surface realizers that use it, including CoGenTex’s REALPRO (Lavoie and 
Rambow, 1997) and ERLTs AlethGen (Coch, 1996b), start with the deep 
levels and map from level to level until they reach the surface level. 

Discourse generation has been a concern of NLG from the beginning. 
Davey’s proteus, for example, produced paragraph-length summaries of 
tic-tac-toe games. His system structured its output based heavily upon the 
structure of the trace of the game which the application system recorded. 
Schema-based text structuring, pioneered by McKeown (1985), is more flex¬ 
ible and has been used in a number of applications (Milosavljevic, 1997; 

Paris, 1993; McCoy, 1985). The schema-based example presented in this 
chapter is based on the COMET instruction generation system (McKeown 
etal , 1990). Although other theories of discourse structure (cf. Chapter 18) 
have influenced NLG, including theories by Grosz and Sidner (1986), Hobbs 
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(1979a), and Kamp’s drt (1981), Rhetorical Structure Theory (RST), devel¬ 
oped by Mann and Thompson (1987b), has had the most influence (Marcu, 
1998; Scott and Souza, 1990; Hovy, 1988b). The classic automobile exam¬ 
ple in ths chapter is adapted from Mann and Thompson (Mann and Thomp¬ 
son, 1986), and the RST-based planning example is based on Moore and 
Paris’ text planner (Moore and Paris, 1993) as it was used in the drafter 
(Paris and Vander Linden, 1996; Paris et ah, 1995), ISOLDE (Paris et ah, 
1998) and wip (Wahlster et ah , 1993) projects. The use of this planner in 
the context of an interactive dialog system is described by Moore and Paris 
(1993). A more recent alternative to this approach has been developed by 
Marcu (1998). 

Applications of NLG tend to focus on relatively restricted sublanguages 
(cf. Chapter 21), including weather reports (Coch, 1998; Goldberg et ah, 
1994), instructions (Paris et ah, 1998; Paris and Vander Linden, 1996; Wahlster 
etah , 1993), encyclopedia-like descriptions (Milosavljevic, 1997; Dale et ah, 
1998b), and letters (Reiter et ah, 1999). The output can be delivered as sim¬ 
ple text or hypertext (Lavoie et ah, 1997; Paris and Vander Linden, 1996), 
dynamically generated hypertext (Dale et ah, 1998b), multimedia presen¬ 
tation (Wahlster et ah, 1993), and speech (Van Deemter and Odijk, 1997). 
Information on a number of these systems is available on-line at the SIGGEN 
web site. 

The evaluation of NLG systems has received much recent attention. 
Evaluations have assessed the similarity of the output with a representative 
corpus (Yeh and Mellish, 1997; Vander Linden and Martin, 1995), convened 
panels of experts to review the text (Lester and Porter, 1997; Coch, 1996a), 
and tested how effective the text was at achieving its communicative purpose 
(Reiter et ah, 1999). It is also becoming more common for the usability of 
the NLG system itself to be evaluated. 

Other issues of interest in NLG include the use of connectionist and 
statistical techniques (Langkilde and Knight, 1998; Ward, 1994), and the 
viability of multilingual generation as an alternative to machine translation 
(Hartley and Paris, 1997; Goldberg et ah, 1994). 


Exercises 

20.1 Use the systemic grammar given in the chapter to build a multiple- 
layer analysis of the following sentences: 
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a. The document will be saved by the system. 

b. Will the document be saved by the system? 

c. Save the document. 

20.2 Extend the systemic grammar given in the chapter to handle the fol¬ 
lowing sentences: 

a. The document is large, (a “relational process”) 

b. Give the document to Mary. 

c. Is the document saved? (a “polar interrogative”) 

20.3 Use the FUF grammar given in the chapter to build a fully unified FD 
for the following sentences: 

a. The system saves the document. 

b. The systems save the document. 

c. The system saves the documents. 

20.4 Extend the FUF grammar given in the chapter to handle the following 
sentences: 

a. The document will be saved by the system, (i.e., the passive) 

b. Will the document be saved by the system? (i.e., wh- questions) 

c. Save the document, (i.e., imperative commands) 

20.5 Select a restricted sublanguage (cf. Chapter 21) and build either a sys¬ 
temic or fuf generation grammar for it. The sublanguage should be subset 
of a restricted domain such as weather reports, instructions, or responses to 
simple inquires. As a test, you can download either fuf or kpml, whichever 
is appropriate, and implement your grammar. Both systems can be found 
through the SIGGEN web site. (Note that it is much easier to build test 
grammars with FUF than with KPML.) 

20.6 Compare and contrast the SPL input to kpml (discussed in the biblio¬ 
graphical and historical notes) and the fd input to fuf. What decisions are 
required of the discourse planner for each of them? What are their relative 
strengths and weaknesses? 

20.7 (Adapted from McKeown (1985)) Build an ATN appropriate for struc¬ 
turing a typical encyclopedia entry. Would it be in any way different from 
an ATN for a dictionary entry, and if so, could you adapt the same ATN for 
both purposes? 
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20.8 (Adapted from Bateman (1997)) Build a system network for using 
“dr”, “mr”, “ms”, “mrs”, “miss” in expressions like “Miss. Jones” and “Mr. 
Smith”. What information would the knowledge base need to contain to 
make the appropriate choices in your network? 

20.9 Do an RST analysis for the following text: 

Temperature Adjustment 

Before you begin, be sure that you have administrator access 
to the system. If you do, you can perform the following steps: 

a. From the EMPLOYEE menu select the Adjust Temperature 
item. The system displays the Adjust Temperature dialog 
box. 

b. Select the room. You may either type the room number or 
click on the appropriate room’s icon. 

c. Set the temperature. In general you shouldn’t change the 
temperature too drastically. 

d. Click the ok button. The system sets the room temperature. 

By entering a desired temperature, you are pretending that 
you just adjusted the thermostat of the room that you are in. 

The chapter lists a subset of the RST relations. Does it give you all the 
relations you need? How do you think your analysis would compare with 
the analyses produced by other analysts? 

20.10 How does RST compare with Grosz and Sidner’s theory of discourse 
presented in Chapter 18? Does one encompass the other or do they address 
different issues? Why do you think that RST has had a greater influence on 
NLG? 

20.11 Would RST be useful for interactive dialog? If so, how would you 
use it? If not, what changes would you make to get it to work 

20.12 (Adapted from ISOLDE (Paris et al , 1998)) Speculate on how you 
would enhance an RST-based discourse planner to plan multi-modal dis¬ 
course, which would include diagrams and formatting (such as html for¬ 
matting). 

20.13 (Adapted from stop (Reiter et al , 1999)). This chapter did not 
discuss template generators in any detail, it simply mentioned that they are 
easy to implement but inflexible. Try writing a simple template generator 
that produces persuasive letters addressed to people trying to convince them 
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to stop smoking. The letter should include the standard elements of a letter 
as well as a discussion of the dangers of smoking and the advantages of 
quitting. For ideas, you can visit the stop web site, available through the 
SIGGEN web site. 

How flexible can you make the mechanism within the confines of tem¬ 
plate generation? Can you extend the system to take a case file on a partic¬ 
ular patient that contains their medical history and produces a customized 
letter? 

20.14 (Adapted from peba (Milosavljevic, 1997)). In the manner dis¬ 
cussed in exercise 20.13, write a template generator that produces encyclopedia¬ 
like descriptions of animals. For ideas, you can visit the peba II web site, 
available through the SIGGEN web site. 





MACHINE 

TRANSLATION* 


... Translation is a fine and exacting art, but there is much about 
it that is mechanical and routine Kay (1997) 


This chapter introduces techniques for machine translation (MT), the translation 
use of computers to automate some or all of the process of translating from mt 
one language to another. Translation, in its full generality, is a difficult, 
fascinating, and intensely human endeavor, as rich as any other area of hu¬ 
man creativity. Consider the following passage from the end of Chapter 
45 of the 18th-century novel The Story of the Stone , also called Dream of 
the Red Chamber , by Cao Xue Qin (Cao, 1973), with the Chinese original 
transcribed in the Mandarin dialect, and the English translation by David 
Hawkes: 

As she lay there alone, Dai-yu’s thoughts turned to Bao-chai... Then she lis¬ 
tened to the insistent rustle of the rain on the bamboos and plantains outside 
her window. The coldness penetrated the curtains of her bed. Almost without 
noticing it she had begun to cry. 

dai yu zi zai chuang shang gan nian bao chat .. 

Dai-yu alone on bed top think-of-with-gratitude Bao-chai 

you ting fan chuang wai zhu shao xiang ye zhe 

again listen to window outside bamboo tip plantain leaf of 

shang, yu sheng xi li, qing han tou mu, 

on-top, rain sound sigh drip, clear cold penetrate curtain, 

bu jue you di xia lei lai 

not feeling again fall down tears come. 

Consider some of the issues involved in this kind of literary transla¬ 
tion. First, there is the problem of how to translate the Chinese names, 


*This chapter mostly by Nigel Ward 
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complicated by Cao’s frequent use of names involving wordplay. Hawkes 
chose to use transliterations for the names of the main characters but to 
translate names of servants by their meanings (Aroma, Skybright). Chinese 
rarely marks verbal aspect or tense; Hawkes thus had to decide to translate 
Chinese tou as penetrated , rather than say was penetrating or had pene¬ 
trated. Hawkes also chose the possessive pronoun her to make her window 
more appropriate for the mood of a quiet bedroom scene than the window , 
To make the image clear for English readers unfamiliar with Chinese bed- 
curtains, Hawkes translated ma (‘curtain’) as curtains of her bed. Finally, 
the phrase bamboo tip plantain leaf although elegant in Chinese, where such 
four-character phrases are a hallmark of literate prose, would be awkward if 
translated word-for-word into English, and so Hawkes used simply bamboos 
and plantains. 

Translation of this sort clearly requires a deep and rich understanding 
of the source language and the input text, and a sophisticated, poetic, and 
creative command of the target language. The problem of automatically 
producing a high-quality translation of an arbitrary text from one language 
to another is thus far too hard to automate completely. But certain simpler 
translation tasks can be addressed with current computational models. In 
particular, machine translation system often focus on (1) tasks for which 
a rough translation is adequate, (2) tasks where a human post-editor can 
be used to improve MT output, and (3) tasks limited to small sublanguage 
domains in which fully automatic high quality translation is achievable. 

Information acquisition on the Web is the kind of ‘information pull’ 
task where readers may be willing to settle for a very rough translation. Con¬ 
sider these extracts from a French web page and a machine translation: 

Nous sommes une association type Loi de 1901, et notre raison d’etre 
est de practiquer, de promouvoir, de faire decouvrir le Paintball, et le 
cas echeant de supporter nos equipes de competition: ... Si vous avez 
des questions, des envies d’organisation de parties, des envies de jouer 
tout courte et des envies de decouvrir, n’hesitez pas a nous contacter 
par courrier ou par telephone ou bien encore par eMail. ... Au sortir de 
la saison 97/98 et surtout au debut de cette saison 98/99, les effectifs 
des HORS-TAXE sont modifies. 

We are a standard association Loi of 1901, and our raison d’ecirc;tre is to 
practice, promote, make discover Paintball, and to support our teams of com¬ 
petition if necessary: ...If you have questions, desires of organization of 
parts, desires for playing very short and desires for discovering, do not hes- 



itate to contact us by mail or telephone or even by eMail. .. .With leaving 
season 97/98 and especially at the beginning of this season 98/99, manpower 
of the HORS-TAXE are modified! 
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This is good enough to figure out that we have the found the home 
page of a paintball team, and one that seems friendly and perhaps willing to 
accept new members. Armed with this information, we can then try to find 
someone to properly translate it for us, or perhaps just go ahead and send 
e-mail to the organizer to ask if we can play. Incidentally, the use of MT for 
such document-finding purposes can sometimes be avoided or made more 
efficient by using cross-language information retrieval techniques, which 
focus on the retrieval of documents in a language other than that used for the 
query terms (Oard, 1997). 

Rough translation is also useful as the first stage in a complete trans¬ 
lation process. An MT system can produce a draft translation that can be 
fixed up in a post-editing process by a human translator. Even a rough 
draft can sometimes speed up the overall translation process. Strictly speak¬ 
ing, systems used in this way are doing computer-aided human transla¬ 
tion (CAHT or CAT) rather than (fully automatic) machine translation. This 
model of MT usage is effective especially for high volume jobs and those 
requiring quick turn-around. The most familiar example is perhaps the trans¬ 
lation of software manuals for localization to reach new markets. Another 
effective application is the translation of market-moving financial news, for 
example from Japanese to English for use by stock traders. 

Weather forecasting is an example of a sublanguage domain that can 
be modeled completely enough to use raw MT output even without post¬ 
editing. Weather forecasts consist of phrases like Cloudy with a chance of 
showers today and Thursday ., Low tonight 4 , high Thursday 10. and Out¬ 
look for Friday: sunny. This domain has a limited vocabulary and only a 
few basic phrase types. Ambiguity is rare, and the senses of ambiguous 
words are distinct and easily disambiguated based on local context, using 
word classes and semantic features such as month, place, direction, 
TIME POINT, TIME DURATION, DEGREE-OF-POSSIBILITY. Other domains 
that are sublanguage-like include equipment maintenance manuals, air travel 
queries, appointment scheduling, and restaurant recommendations. 


LANGUAGE 

INFORMATION 

RETRIEVAL 


POST-EDITING 


COMPUTER- 
AIDED HUMAN 
TRANSLATION 


LOCALIZA¬ 

TION 


SUBLAN¬ 

GUAGE 


This chapter breaks with the pattern of previous chapters in that the 
focus is less on introducing new techniques than on showing how the tech¬ 
niques presented earlier are used in practice. One of the themes of this chap¬ 
ter is that there are often trade-offs and difficult choices among alternative 
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approaches and techniques. 

Section 21.1 gives some simple illustrations of the ways in which lan¬ 
guages differ. The following four sections are organized four basic mod¬ 
els for doing MT: Section 21.2 introduces the use of syntactic transforma¬ 
tions for overcoming differences in grammar, as well as some techniques for 
choosing target language words. Section 21.3 introduces some ways of ex¬ 
ploiting meaning during translation, in particular the use of thematic roles 
and primitive decomposition. Section 21.4 presents the minimalist ‘direct’ 
approach. Section 21.5 discusses the use of statistical techniques to improve 
various aspects of MT. Finally, Section 21.6 discusses reasons for the gap 
between expectations and performance, and discusses strategies for meeting 
users’ needs despite finite development resources. 


21.1 Language Similarities and Differences 


UNIVERSAL 


TYPOLOGY 


ISOLATING 

POLYSYN¬ 

THETIC 


AGGLUTINA¬ 

TIVE 


When you accidentally pick up a radio program in some foreign language it 
seems like chaos, completely unlike the familiar languages of your everyday 
life. But there are patterns in this chaos, and indeed, some aspects of human 
language seem to be universal, holding true for every language. Many uni¬ 
versal arise from the functional role of language as a communicative system 
by humans. Every language, for example, seems to have words for referring 
to people, for talking about women, men, and children, eating and drink¬ 
ing, for being polite or not. Other universals are more subtle; for example 
Chapter 8 mentioned that every language seems to have nouns and verbs. 

Even when languages differ, these differences often have systematic 
structure. The study of systematic cross-linguistic similarities and differ¬ 
ences is called typology (Croft (1990), Comrie (1989)). This section sketches 
some typological facts about crosslinguistic similarity and difference. This 
bears on our main topic, MT, in that the difficulty of translating from one 
language to another depends a great deal on how similar the languages are 
in their vocabulary, grammar, and conceptual structure. 

Morphologically, languages are often characterized along two dimen¬ 
sions of variation. The first is the number of morphemes per word, rang¬ 
ing from isolating languages like Vietnamese and Cantonese, in which each 
word generally has one morpheme, to polysynthetic languages like Siberian 
Yupik (Eskimo), in which a single word may have very many morphemes, 
corresponding to a whole sentence in English. The second dimension is the 
degree to which morphemes are segmentable, ranging from agglutinative 
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languages like Turkish (discussed in Chapter 3), in which morphemes have 
relatively clean boundaries, to fusion languages like Russian, in which a fusion 
single affix may conflate multiple morphemes, like -om in the word stolom , 
(table-SG-iNSTR-DECL 1) which fuses the distinct morphological categories 
instrumental, singular, and first declension. 

Syntactically, languages are perhaps most saliently different in the ba¬ 
sic word order of verbs, subjects, and objects in simple declarative clauses. 
German, French, English, and Mandarin, for example, are all SVO lan- svo 
guages, meaning that the verb tends to come between the subject and object. 

Hindi and Japanese, by contrast, are SOV languages, meaning that the verb sov 
tends to come at the end of basic clauses, while Irish, Classical Arabic, and 
Biblical Hebrew are VSO languages. Two languages that share their basic vso 
word-order type often have other similarities. For example SVO languages 
generally have prepositions while SOV languages generally have postposi¬ 
tions; English has to Yuriko where Japanese has Yuriko ni. 

Another important syntactico-morphological distinction is between 
head-marking and dependent-marking languages (Nichols, 1986). Head¬ 
marking languages tend to mark the relation between the head and its depen¬ 
dents on the head. Dependent-marking languages tend to mark the relation 
on the non-head. Nichols (1986) for example, notes that Hungarian marks 
the possessive relation with an affix (A) on the head noun (H), where English 
marks it on the (non-head) possessor: 

(21.1) English the many’s H house 
Hungarian az ember ff haz- A a 

the man house-his 

This syntactic distinction is related to a semantic distinction in how 
languages map conceptual notions onto words. Talmy (1985) and (1991) 
noted that languages can be characterized by whether direction of motion 
and manner of motion are marked on the verb or on the ‘satellites’: particles, 
prepositional phrases, or adverbial phrases. For example a bottle floating out 
of a cave would be described in English with the direction marked on the 
particle out as: 

(21.2) The bottle floated out. 

but in Spanish with the direction marked on the verb as 

(21.3) La botella salid flotando. 

The bottle exited floating. 
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Languages that mark the direction of motion on the verb (leaving the 
satellites to mark the manner of motion) Talmy called verb-framed; Slobin 
(1996) gives examples like Spanish acercarse ‘approach’, alcanzar ‘reach’, 
entrar ‘enter’, salir ‘exit’. Languages that mark the direction of motion on 
the satellite (leaving the verb to mark the manner of motion) Talmy called 
satellite-framed; Slobin (1996) gives examples like English crawl out, float 
off, jump down, walk over to, run after . Talmy (1991) noted that verb- 
framed languages include Romance, Semitic, Japanese, Tamil, Polynesian, 
most Bantu, most Mayan, Nez Perce, and Caddo, while satellite-framed lan¬ 
guages include most Indo-European minus Romance, Finno-Ugric, Chinese, 
Ojibwa, and Warlpiri. 

In addition to such properties that systematically vary across large 
classes of languages, there are many specific characteristics, more or less 
unique to single languages. English, for example, has an idiosyncratic syn¬ 
tactic construction involving the word there that is often used to introduce a 
new scene in a story, as in there burst into the room three men with guns. 

To give an idea of how trivial, yet crucial, these differences can be, 
think of dates. Dates not only appear in various formats — typically YYM- 
MDD in Japanese, MM-DD-YY in American English, and DD/MM/YY in 
British English — the calendars themselves may differ, for example dates in 
Japanese often are relative to the start of the current Emperor’s reign rather 
than to the start of the Christian Era. 

Turning now to the question of lexical organization, here too there 
are interesting patterns. Many words can be translated relatively directly 
into other languages. English dog, for example, translates to Mandarin gou. 
Where English has chocolate, Italian has cioccolato and Japanese has choko- 
reeto} 

Sometimes, rather than a single word, there is a fixed phrase in the 
target language; French informatique thus translates to English computer 
science. In more difficult cases, however, a word in one language does not 
map so simply to a word or phrase in another language. 

Grammatically, for example, a word may translate best to a word of an¬ 
other part of speech in the target language. Many English sentences involv¬ 
ing the verb like must be translated into German using the adverbial germ, 
thus she likes to sing maps to sic singt gerne, where the syntactic structure is 
also affected. 


1 although chokoreeto in Japanese is perforce more formal than English chocolate , since 
Japanese also has the informal short form choko. 
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Sometimes one language places more grammatical constraints on word 
choice than another. English, for example, distinguishes gender in pronouns 
where Mandarin does not; thus translating a third-person singular pronoun 
from Mandarin to English requires deciding whether the original referent 
was masculine or feminine. The same is true when translating from the 
English pronoun plural they , unspecified for gender, into French (masculine 
Us, feminine elles). In Japanese, there is no single word for is, speakers must 
choose between iru or aru, based on whether the subject is animate 2 or not. 

Such differences in specificity also occur on the semantic side: one lan¬ 
guage may divide up a particular conceptual domain in more detail than an¬ 
other. English, for example, has a particularly impoverished kinship vocab¬ 
ulary; the single word brother can indicate either a younger or older brother. 
Japanese and Chinese, by contrast, both distinguish seniority in sibling rela¬ 
tions. Figure 21.1 gives some further examples. 


English 

brother 

Japanese 

Japanese 

Mandarin 

Mandarin 

otooto (younger) 
oniisan (older) 
gege (older) 
didi (older) 

English 

wall 

German 

German 

Wand (inside) 

Mauer (outside) 

English 

know 

French 

French 

connaitre (be acquainted with) 
savoir (know a proposition) 

English 

they 

French 

French 

Us (masculine) 
elles (feminine) 

German 

berg 

English 

English 

hill 

mountain 

Mandarin 

ta 

English 

he, she, or it 

Figure 21.1 

Differences in specificity. 



The way that languages differ in lexically dividing up conceptual space 
may be more complex than this one-to-many translation problem, leading to 
many-to-many mappings. For example Figure 21.2 summarizes some of the 
complexities discussed by Hutchins and Somers (1992) in relating English 
leg, foot, and paw, to the French jambe, pied, patte, etc. 

Further, one language may have a lexical gap, where no word or phrase, lexical gap 
short of an explanatory footnote, can express the meaning of a word in the 


2 Taxis and buses in service sometimes count as animate for this purpose. 
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other language. For example, Japanese does not have a word for privacy , 
and English does not have a word for Japanese oyakoko (we make do with 
filial piety). 

Moreover, dependencies on cultural context, as manifest in the back¬ 
ground and expectations of the readers of the original and translation, further 
complicate matters. A number of translation theorists (Steiner, 1975; Barn- 
stone, 1993; Hofstadter, 1997) refer to a clever story by Jorge Luis Borges 
showing that even two linguistic texts with the same words and grammar may 
have different meanings because of their different cultural contexts. Borges 
invents Menard, a French author in the 1930’s whose aim was to recreate 
Cervantes’ Don Quixote word for word: 

The text of Cervantes and that of Menard are verbally identical, but the sec¬ 
ond is almost infinitely richer. (More ambiguous, his detractors will say; but 
ambiguity is a richness.) It is a revelation to compare the Don Quijote of 
Menard with that of Cervantes. The latter, for instance, wrote: 

...la verdad, cuya madre es la historia, emula del tiempo, depdsito de 
las acciones, testigo de lo pasado, ejemplo y aviso de lo presente, ad- 
vertencia de lo por venir. 

Menard, on the other hand, writes: 

...la verdad, cuya madre es la historia, emula del tiempo, depdsito de 
las acciones, testigo de lo pasado, ejemplo y aviso de lo presente, ad- 
vertencia de lo por venir. 
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Equally vivid is the contrast in styles. The archaic style of Menard - in the 
last analysis, a foreigner — suffers from a certain affectation. Not so that of 
his precursor, who handles easily the ordinary Spanish of his time. 

These last points suggest a more general question about cultural dif¬ 
ferences and the possibility (or impossibility) of translation. A theoretical 

SAPIR- 

position sometimes known as the Sapir-Whorf hypothesis suggests that lan- whorf 
guage may constrain thought — that the language you speak may affect the 
way you think. To the extent that this hypothesis is true, there can be no per¬ 
fect translation, since speakers of the source and target languages necessarily 
have different conceptual systems. In any case it is clear that the differences 
between languages run deep, and that the process of translation is not going 
to be simple. 


21.2 The Transfer Metaphor 


As the previous section illustrated, languages differ. One strategy for doing 
MT is to translate by a process of overcoming these differences, altering the 
structure of the input to make it conform to the rules of the target language. 

This can be done by applying contrastive knowledge, that is, knowledge 
about the differences between the two languages. Systems that use this strat¬ 
egy are sometimes said to be based on the transfer model. mo a de s l fer 

Since this requires some representation of the structure of the input, 
transfer presupposes a parse of some form. Moreover, since transfer only 
results in a structure for the target language, it must be followed by a gener¬ 
ation phase to actually create the output sentence. Thus, on this model, MT 
involves three phases: analysis, transfer, and generation, where transfer 
bridges the gap between the output of the source language parser and the 
input to the target language generator. Figure 21.3 shows a sketch of this 
transfer architecture. 

It is worth noting that a parse for MT may differ from parses required 
for other purposes. For example, suppose we need to translate John saw the 
girl with the binoculars into French. The parser does not need to bother to 
figure out where the prepositional phrase attaches, because both possibilities 
lead to the same French sentence. However this is not true for all prepo¬ 
sitional phrase attachments, and so a MT system needs also to be able to 
represent disambiguated parses, while still being able to work with ambigu¬ 
ous ones (Emele and Doma, 1998). 
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Figure 21.3 The transfer architecture for Machine Translation. 


Syntactic Transformations 

Let us begin by considering syntactic differences. The previous section noted 
that in English the unmarked order in a noun-phrase had adjectives precede 
nouns, but in French adjectives follow nouns. 3 Temporarily postponing the 
question of how to translate the words, let’s consider how an MT system can 
overcome such differences. 


noun phrase 


noun phrase 


adjective noun 


noun adjective 


Figure 21.4 A simple transformation that reorders adjectives and nouns 


Figure 21.4 suggests the basic idea. Here we transform one parse tree, 
suitable for describing an English phrase, into another parse tree, suitable 

SYNTACTIC 

TRANSFOR- for describing a French sentence. In general, syntactic transformations are 

MAHONS 

operations that map from one tree structure to another. 

Now let’s illustrate how roughly how such transformations can restruc¬ 
ture an entire sentence, using a simplified sentence: 

(21.4) There was an old man gardening. 

We will assume that the parser has given us a structure like the follow¬ 
ing. We will also assume that the system starts performing transformations 

3 There are exceptions to this generalization, such as galore in English and gros in French; 
furthermore in French some adjectives can appear before the noun with a different meaning; 
route mauvaise ‘bad road, badly-paved road’ versus mauvaise route ‘wrong road’ (Waugh, 
1976 ). 
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at the top node of the tree and works its way down: 


Existential-There-Sentence 


there was an old man gardening 


Since this sentence involves an “existential there construction”, which 
has no analog in Japanese, we immediately have to apply a transformation 
that deletes the sentence-initial there and converts the fourth constituent to 
a relative clause modifying the noun, producing something like following 
structure: 


Intermediate-Representation 



an old man gardening was 


The resulting structure is thus something more like the structure of a 
pseudo-English sentence: an old man, who was gardening, was. 

Next, another transformation applies to reverse the order of the noun 
phrase and the relative clause, giving something like the following structure: 


Intermediate-Representation-2 



gardening an old man was 


At this point all relevant transformations have applied, and lexical 
transfer takes place, substituting Japanese words for the English ones, as 
discussed in the next section. This gives the final structure below: 


Japanese-S 



niwa no teire o suru ojiisan ita 

After this, a little more syntactic work is required to produce an actual 
Japanese sentence, including: 1. adding the word ga , which is required in 
Japanese to mark the subject, 2. choosing the verb that agrees with the sub¬ 
ject in terms of animacy, namely iru , not aru , and 3. inflecting the verbs. The 
final generation step traverses or otherwise linearizes the tree to produce a 
string of words. Although these generation tasks can be done by the tech¬ 
niques of Chapter 20, practical systems usually do them directly with simple 
procedures. In any case, the final output will be: 
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niwa no teire o shite ita ojiisan ga ita. 

garden gen upkeep OBJ do past-prog old man subj was 

Table 21.5 shows a rough representation of the transformations we 
have discussed. Such transformations can be implemented as pattern-rewrite 
rules: if the input matches the left side of a transformation, it is rewritten 
according to the right side. 


English to French: 

i. 

NP —> Adjective] Noun 2 

NP —> Noun 2 Adjective] 

Japanese to English: 

2 . 

Existential-There-Sentence —> Therei Vert >2 NP 3 PostnominaU 

Sentence —> (NP —> NP 3 Relative-Clause^ Verb 2 

3. 

NP —»NP] Relative Clause 2 

NP -¥ Relative-Clause 2 NP] 

Figure 21.5 An informal description of some transformations. 


Transformations in MT systems also may have more complex condi¬ 
tions for when they apply, and may include a “trigger”, that is, a specific 
word that is used to index the pattern, for efficiency. One way to formalize 
transformations is with unification-based models; indeed as Chapter 11 dis¬ 
cussed, the need for a reversible operation for MT was the original motiva¬ 
tion for both feature-structure unification (Kay, 1984) and term-unification 
(Colmerauer and Roussel, 1996). However, unification is computationally 
expensive and is not commonly used. 

Lexical Transfer 

Some of the output words are determined in the course of syntactic transfer 
or generation. In the example above, the function words ga and ita are mostly 
grammatically controlled. Content words are another matter. The process of 
finding target language equivalents for the content words of the input, lexical 
transfer, is difficult for the reasons introduced in Section 21.1. 

The foundation of lexical transfer is dictionary lookup in a crosslan¬ 
guage dictionary. As was discussed earlier, the translation equivalent may 
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be a single word or it may be a phrase, as in this example where gardening 
becomes niwa no teire o sum (‘do garden upkeep’). Furthermore, sometimes 
a generation process must subsequently inflect words in such phrases, as in 
this case. 

Section 21.1 also discussed the problem of words that have several pos¬ 
sible translations. In the example man is such a word. The correct choice 
here was ojiisan (‘old man’), but if the input had been man is the only linguis¬ 
tic animal , the translation of man would have been ningen (‘human being, 
man, men’); in most other cases hito (‘person, persons, man, men’) or re¬ 
lated words would have been appropriate. Fortunately there are at least two 
ways to tackle this problem: in the parsing or in the generation stage. The 
first method is to treat words like man as if they were ambiguous. That is, we 
assume that man can correspond to two more more concepts (perhaps hu¬ 
man and adult male) and that choosing the correct Japanese word is like 
disambiguating between these concepts. This way of treating lexical transfer 
lets us apply all the standard techniques for lexical disambiguation (Chap¬ 
ter 16). A second way is to treat such words as having only one meaning, 
and to handle the selection among multiple possible translations (ningen, 
hito , ojiisan and so on) by using constraints imposed by the target language 
during generation (Whitelock, 1992). In practice, these cases are more of¬ 
ten dealt with in the parsing stage, as the algorithms for lexical choice dur¬ 
ing generation are high-overhead (Ward, 1994), especially for content words 
(but see Section 21.5). 

In this specific example, however, the choice of how to translate man 
is easy. Because the previous word is old , the correct translation is ojiisan 
(‘old man’). Such inputs, where multiple source language words must be 
expressed with a single target language word, can be difficult to handle, re¬ 
quiring inference in the general case. But many such cases, including this 
one, can be treated simply as idioms, with their own entries in the bilingual 
dictionary. 


21.3 The Interlingua Idea: Using Meaning 


One problem with the transfer model is that it requires a distinct set of trans¬ 
fer rules for each pair of languages. This is clearly suboptimal for translation 
systems employed in multilingual environments like the European Union, 
where eleven official languages need to be intertranslated. 

This suggests a different perspective on the nature of translation. The 
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INTERLINGUA 


ONTOLOGY 


transfer model treats translation as a process of altering the structure and 
words of an input sentence to arrive at a valid sentence of the target language. 
An alternative to is to treat translation as a process of extracting the meaning 
of the input and then expressing that meaning in the target language. If this 
can be done, a MT system can do without contrastive knowledge, merely re¬ 
lying on the same syntactic and semantic rules used by a standard interpreter 
and generator for the language. The amount of knowledge needed is then 
proportional to the number of languages the system handles, rather than to 
the square, or so the argument goes. 

This scheme presupposes the existence of a meaning representation, 
or interlingua, in a language-independent canonical form, like the semantic 
representations we saw in Chapter 14. The idea is for the interlingua to rep¬ 
resent all sentences that mean the ‘same’ thing in the same way, regardless 
of the language they happen to be in. Translation in this model proceeds 
by performing a semantic analysis on the input from language X into the 
interlingual representation and generating from the interlingua to language 
Y. 

A frequently used element in interlingual representations is the notion 
of a small fixed set of thematic roles, as discussed in Chapter 16. When used 
in an interlingua, these thematic roles are taken to be language universals. 
Figure 21.6 shows a possible interlingual representation for there was an old 
man gardening as a unification-style feature structure 4 . We saw in Chap¬ 
ter 15 how a semantic analyzer can produce such a structure with a agent 
relation between man and gardening. Note that since the interlingua requires 
such semantic interpretation in addition to syntactic parsing, it requires more 
analysis work than the transfer model, which only required syntactic pars¬ 
ing. But generation can now proceed directly from the interlingua with no 
need for syntactic transformations. 

Note that the representation in Figure 21.6 includes the value gar¬ 
dening as the value for the event feature, and, although such cases are 
familiar from Chapter 14, one might object that this looks more like an En¬ 
glish word than it does an an element in a truly interlingual representation. 
There is a deeper question here, that of the appropriate inventory of concepts 
and relations for an interlingua; that is what ontology to use. Certainly a 
meaning representation designer has a lot of freedom when selecting a set 


4 Of course this is seriously inadequate as an account of the meaning of the existential-there 
construction. In fact, the currently least incomplete account of the syntax and semantics of 
there constructions in English takes 124 pages (Lakoff, 1987). 
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EVENT 

GARDENING 




MAN 


AGENT 


NUMBER 

SG 



DEFINITENESS 

INDEF 

ASPECT 

PROGRESSIVE 


TENSE 

PAST 



Figure 21.6 Interlingual representation of there was an old man gardening. 


of tokens and ascribing meanings to them. However, choice of an ontology 
for MT is not to be undertaken lightly, since it constrains the architecture 
of the system as a whole. For example, recall from Chapter 16 the discus¬ 
sion of two possible inventories of thematic roles, one containing agent 
and force, and one including agent only. The choice of which to adopt 
affects, for example, the way that the system will translate the quake broke 
glass (Chapter 16) into Japanese, where quake needs to be marked with de, 
not the usual subject marker ga , because the earthquake is not animate. If 
we design our interlingua using the smaller inventory that only uses agent, 
then the representation for this sentence will place the quake in the agent 
role, and the problem of de versus ga will fall to the generator. If, however, 
we use the expanded inventory of Figure 16.9, then the representation will 
include the force role, with the work needed to make that decision being 
performed by the semantic analyzer. 

The interlingua idea has implications not only for syntactic transfer 
but also for lexical transfer. The idea is to avoid explicit descriptions of 
the relations between source language words and target language words, in 
favor of mapping via concepts, that is, language-independent elements of 
the ontology. Recalling our earlier problem of whether to translate man as 
otoko , ningen , ojiisan , etc. it is clear that most of the processing involved is 
not specific to the goal of translating into Japanese; there is a more general 
problem of disambiguating man into concepts such as generic-human 
and male-human. If we commit to using such concepts in an interlingua, 
then a larger part of the translation process can be done with general lan¬ 
guage processing techniques and modules, and the processing specific to the 
English-to-Japanese translation task can be eliminated or at least reduced. 

Some interlinguas, and some other representations, go further and use 
lexical decomposition, that is, the disassembly of words into their component 
meanings. We saw a form of this in Figure 21.6, where was maps to PAST and 
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progressive, and a maps to singular and indefinite. Decomposition 
of content words is also possible: the word drink can be represented by 
(ingest, fluid, by-mouth) 5 . Representing a sentence by breaking down 
the words in such ways does seem to be actually capturing something about 
meaning, rather than being just a rearrangement of tokens that look like the 
English words of the input. Moreover, such representations are potentially 
useful for inference-based disambiguation. For example, it is possible to use 
the meanings of the words to infer what the prepositional phrase is modifying 
in the policeman saw the man with a telescope , versus the policeman shot 
the man with a telescope. It is, however, difficult to get inference of this 
sort to work for more than a few examples except in very small domains. 
In general, such high-powered interlingua-based techniques are not used in 
practice. 



Figure 21.7 Diagram Suggesting the Relation Between the Transfer and 
Interlingua Models, generally credited to Vauqois. 


Brushing over numerous important details, we can now contrast the 
transfer model with the interlingua model. The key implication for process- 

5 This use of semantic decomposition makes it clear which elements of meaning drink 
shares with eat and which it does not share. But as Chapter 16 discusses, lexical semantics is 
not so easy in general. For example, how does one express in a formal language the meaning 
of heft and the way it differs from weight , or the meanings of sporadic and intermittent ? 
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ing is that, by making the parser/interpreter and/or the generator do a little 
more work, we can eliminate the need for contrastive knowledge, as sug¬ 
gested in Figure 21.7. 

Doing the extra work involved by the interlingua commitment, how¬ 
ever, is not always easy. It requires the system designer to perform exhaus¬ 
tive analysis of the semantics of the domain and formalize that in an ontol¬ 
ogy (Levin et al , 1998). Today this is more an art than a science, although 
it is relatively tractable in sublanguage domains. In some cases the seman¬ 
tics can mostly be captured by a database model, as in the air travel, hotel 
reservation, or restaurant recommendation domains. In cases like these, the 
database definition determines the possible entities and relations; and the 
MT system designer’s task is largely one of determining how these map to 
the words and structures of the two languages. 

Another problem with the interlingua idea is that, in its pure form, it 
requires the system to fully disambiguate at all times. For a true universal 
interlingua, this may require some unnecessary work. For example, in order 
to translate from Japanese to Chinese the interlingua must include concepts 
such as elder-brother and younger-brother. However, to use those 
same concepts in the course of translating from German-to-English would 
require a parser to perform more disambiguation effort than is unnecessary; 
and will further require the system to include techniques for preserving am¬ 
biguity, to ensure that the output is ambiguous or vague in exactly the same 
way as the input. Even discounting the Sapir-Whorf idea, the idea of a uni¬ 
versal meaning underlying all languages is clearly not without problems. 


21.4 Direct Translation 

These models are all very nice, but what happens if the analysis fails? Users 
do not like to receive an output of “nil” due to “no parse tree found”; in gen¬ 
eral, they would rather get something imperfect than nothing at all. This is a 
challenge especially for interlingua-based models, where the system should 
not fail to translate it broke the glass because it can not figure out whether it 
is a force or agent. 

Several approaches are available. One is to use the robust parsing tech¬ 
niques discussed in Chapter 15, which sometimes amounts to translating by 
fragments. Another is to give up on producing elaborate structural analyses 
at all, and just do simple operations that can be done reliably. More radically, 
we could adopt the principle that a MT system should do as little work as 


PRESERVING 

AMBIGUITY 
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possible. Systems built according to this philosophy are sometimes called 
direct direct MT systems. Typically such systems are built with only one language 
pair in mind, and the only processing done is that needed to get from one 
specific source language to one specific target language. 

A direct MT system is typically composed of several stages, each fo¬ 
cused on one type of problem. For example, we can rewrite a Japanese 
sentence as an English one in six stages, as seen in Figure 21.8. Figure 21.9 


Stage 

Action 

1 . 

morphological analysis 

2. 

lexical transfer of content words 

3. 

various work relating to prepositions 

4. 

SVO rearrangements 

5. 

miscellany 

6. 

morphological generation 

Figure 21.8 Six Stages for a Direct MT System for Japanese to English 


illustrates how this might work for a simple example. 

Stage 1 in Figure 21.9 segments the input string into words (recall that 
Japanese, like Chinese, does not use spaces as word boundary markers), and 
does morphological analysis of complex verb forms. These can be done 
using the finite-state techniques of Chapter 3 and segmentation algorithms 
like the probabilistic one described in Chapter 5. 

Stage 2 chooses translation equivalents for the content words. This is 
done using a bilingual dictionary, or procedures that choose the correct trans¬ 
lation based on the local context and on the target language words already 
chosen. Figure 21.10 illustrates such a procedure. 

In this example lexical transfer is trivial. In general, though, there may 
be interdependencies among target-language words, and so lexical trans¬ 


input: 

After stage 1: 
After stage 2: 
After stage 3: 
After stage 4: 
After stage 5: 
After stage 6: 


watashihatsukuenouenopenwojonniageta. 

watashi ha tsukue no ue no pen wo jon ni ageru PAST. 

I ha desk no ue no pen wo John ni give PAST. 

I ha pen on desk wo John to give PAST. 

I give PAST pen on desk John to. 

I give PAST the pen on the desk to John. 

I gave the pen on the desk to John. _ 


Figure 21.9 An Example of Processing in a Direct System 
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fer this may be done in sub-stages, for example, verbs before nouns be¬ 
fore adjectives. For example, consider the problem of translating nomu from 
Japanese to English, where this must become either drink or take (medicine). 
This decision must be made before translations for modifiers are chosen, to 
allow translations such as drinking heavily and taking a lot of medicine, but 
not a scramble of the two. In general the problem of the best order in which 
to make decisions is a tricky one, although there are some standard solutions, 
as seen in Chapter 20. 

Stage 3 chooses to translate no ue no (‘at top of’) to on, and reverses 
the two associated noun phrases (desk and pen), since English prepositional 
phrases follow, not precede, the word they modify. In accordance with the 
dictionary entry for gave, which specifies subcategorization facts, it chooses 
to translate ni as to. 

Stage 4 invokes a procedure to move the verb from the end of the sen¬ 
tence to the position after the subject, and removes case marking from sub¬ 
jects and direct objects. 

Stage 5 handles things like moving case markers before nouns and in¬ 
serting articles. 

Finally Stage 6 inflects the verbs. 


function Directly_translate j^ucH/MANY(Russian word) returns 

if preceding word is how 
return skoVko 
else if preceding word is as 
return stoVkozhe 
else if word is much 
if preceding word is very 
return nil (not translated) 
else if following word is a noun 
return mnogo 
else /* word is many */ 

if preceding word is a preposition and following word is a noun 
return mnogii 
else return mnogo 


Figure 21.10 A procedure for translating much and many into Russian, 
adapted from Hutchins’ (1986, pg. 133) discussion of Panov 1960. 


There are several ways in which this approach differs from the ap- 
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proaches seen earlier. One is that it is a new way of modularizing the MT 
task, orthogonal to the types of modularity seen in the transfer and interlin¬ 
gua models in Figure 21.7. In the direct model, all the processing involving 
analysis of one specific problem (prepositions for example) is handled in one 
stage, including analysis, transfer, and generation aspects. The advantage of 
this is that solving specific problems one at a time may be more tractable. 
On the other hand, it can be advantageous to organize processing into larger 
modules (analysis, transfer, synthesis) if there is synergy among all the var¬ 
ious individual analysis problems, or among all the individual generation 
problems, etc. 

A second characteristic of direct systems is that lexical transfer may 
be more procedural. Lexical transfer procedures may eclectically look at the 
syntactic classes and semantic properties of neighboring words and depen¬ 
dents and heads, as seen in the decision-tree-like procedure for translating 
much and many into Russian in Figure 21.10. 

A third characteristic of direct models is that they tend to be conser¬ 
vative, to only reorder words when required by obvious ungrammaticality in 
the result of direct word-for-word substitution. In particular, direct systems 
generally do lexical transfer before syntactic processing. 

Perhaps the key characteristic of direct models is that they do with¬ 
out complex structures and representations. In general, they treat the input 
as a string of words (or morphemes), and perform various operations di¬ 
rectly on it — replacing source language words with target language words, 
re-ordering words, etc. — to end up with a string of symbols in the target 
language. 

In practice, of course, working MT systems tend to be combinations of 
the direct, transfer, and interlingua methods. But of course syntactic process¬ 
ing is not an all-or-nothing thing. Even if the system does not do a full parse, 
it can adorn its input with various useful syntactic information, such as part 
of speech tags, segmentation into clauses or phrases, dependency links, and 
bracketings. Many systems that are often characterized as direct translation 
systems also adopt various techniques generally associated with the transfer 
and interlingua approaches (Hutchins and Somers, 1992). 


21.5 Using Statistical Techniques 

The three architectures for MT introduced in previous sections, the transfer, 
interlingua, and direct models, all provide answers to the questions of what 
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representations to use and what steps to perform to translate. But there is 
another way to approach the problem of translation: to focus on the result, 
not the process. Taking this perspective, let’s consider what it means for a 
sentence to be a translation of some other sentence. 

This is an issue to which philosophers of translation have given a lot of 
thought. The consensus seems to be, sadly, that it is impossible for a sentence 
in one language to be a translation of a sentence in other, strictly speaking. 
For example, one cannot really translate Hebrew adonai roi (The Lord is my 
shepherd’) into the language of a culture that has no sheep. On the one hand, 
we can write something that is clear in the target language, at some cost in 
fidelity to the original, something like the Lord will look after me. On the 
other hand, we can be faithful to the original, at the cost of producing some¬ 
thing obscure to the target language readers, perhaps like the Lord is for me 
like somebody who looks after animals with cotton-like hair. As another ex¬ 
ample, if we translate the Japanese phrase jukaku hansei shite orimasu , as 
we apologize , we are not being faithful to the meaning of the original, but if 
we produce we are deeply reflecting (on our past behavior, and what we did 
wrong , and how to avoid the problem next time), then our output is unclear 
or awkward. Problems such as these arise not only for culture-specific con¬ 
cepts, but whenever one language uses a metaphor, a construction, a word, 
or a tense without an exact parallel in the other language. 

So, true translation, which is both faithful to the source language and 
natural as an utterance in the target language, is sometimes impossible. If 
you are going to go ahead and produce a translation anyway, you have to 
compromise. This is exactly what translators do in practice: they produce 
translations that do tolerably well on both criteria. 

This provides us with a hint for how to do MT. We can model the 
goal of translation as the production of an output that maximizes some value 
function that represents the importance of both faithfulness and fluency. If 
we chose the product of fluency and faithfulness as our quality metric, we 
can formalize the translation problem as: 

best-translation T - argmax 7 fluency(T) faithfulness(T,S) 

where T is the target-language-sentence and S the source-language-sentence. 

This model of translation was first described by researchers coming 
from speech recognition (Brown et al. , 1990a, 1993), and this model clearly 
resembles the Bayesian models we’ve used for speech recognition in Chap¬ 
ter 7 and for spell checking in Section 5.4. We can make the analogy perfect 
and apply the noisy channel model of Section 5.4 if we think of things back- 
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wards: thus we pretend that the input we must translate is a corrupted version 
of some target language sentence, and that our task is to discover that target 
language sentence: 

best-translation f = argmax r P(T) /^(Sjr) 

To implement this, we need to do three things: quantify fluency, P(T), 
quantify faithfulness, P(5|T) and create an algorithm to find the sentence 
that maximizes the product of these two things. 

There is an innovation here. In the transfer, interlingua, and direct 
models, each step of the process made some adjustment to the input sentence 
to make it closer to a fluent TL sentence, while obeying the constraint of not 
changing the meaning too much. In those models the process is fixed, in that 
there is no flexibility to trade-off a modicum of faithfulness for a smidgeon of 
naturalness, or conversely, based on the specific input sentence at hand. This 
new model, sometimes called the statistical model of translation allows 
exactly that. 


Quantifying Fluency 

Fortunately, we already have some useful metrics for how likely a sentence 
is to be a real English sentence: the language models from Chapters 6 and 
8. These allow us to distinguish things that are readable but not really En¬ 
glish (such as that car was almost crash onto me) from things that are more 
fluent (that car almost hit me). This is especially valuable for word order 
and collocations, and as such can be a useful supplement to the generation 
techniques of Chapter 20. 

Fluency models can be arbitrarily sophisticated; any technique that can 
assign a better probability to a target language string is appropriate, including 
the more sophisticated probabilistic grammars of Chapter 12 or the statistical 
semantic techniques of Chapter 17. 

Of course, the idea of using monolingual language knowledge to im¬ 
prove MT output is independent of the decision to model that knowledge 
statistically. Indeed, many MT systems, especially direct ones, have a final 
phase, in which the system uses local considerations to revise word choices 
in the output. For example, capitalizing every occurrence of white house 
that occurs as the subject of a verb (the white house announced today) is a 
reasonable heuristic. 
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Quantifying Faithfulness 

Given the French sentence ca me plait (that me pleases ) and some conceiv¬ 
able English equivalents that pleases me , I like it , and III take that one , and 
yes , good , it is intuitively clear that the first is more faithful. 

Although it is hard to quantify this intuition, one basic factor often used 
in metrics for fidelity is the degree to which the words in one sentence are 
plausible translations of the words of the other. Thus we can approximate 
the probability of a sentence being a good translation as the product of the 
probabilities that each target language word is an appropriate translation of 
some source language word. For this we need to know, for every source 
language word, the probability of it mapping to each possible target language 
word. 

Where do we get these probabilities? Standard bilingual dictionaries 
do not include such information, but they can be computed from bilingual 
corpora, that is, parallel texts in two languages. This is not trivial, since 
bilingual corpora do not come with annotations specifying which word maps 
to which. Solving this problem requires first solving the problem of sen¬ 
tence alignment in a bilingual corpus, determining which source language aug t n e ment 
sentence maps to which target language sentence, which can be done with 
reasonable accuracy (Kay and Roscheisen, 1993; Gale and Church, 1993; 
Melamed, 1999; Manning and Schiitze, 1999). The second problem, word 
alignment, that is, determining which word(s) of the target correspond to alignment 
each source language word or phrase, is rather more difficult (Melamed, 
pear), and is often addressed with EM methods (cf. Chapter 7). From bilin¬ 
gual corpora aligned in these ways it is possible to count how many times a 
word, phrase, or structure gets mapped to each of its possible translations. 

Such alignments are potentially useful not only for MT but also for auto¬ 
matic generation of bilingual dictionary entries for use by human translators 
(Dagan and Church, 1997; Fung and McKeown, 1997). 

Let’s now consider an example. Suppose we want to translate the two- 
word Japanese phrase 2000nen taio into English. The most probable transla¬ 
tion for the first word is, we will assume, 2000 , followed by year 2000 , Y2K , 

2000 years , 2000 year and some other possibilities. The most probable trans¬ 
lation for the second word is, we will assume, correspondence , followed by 
corresponding , equivalent , tackle , deal with , dealing with , countermeasures , 
respond , response , counterpart , antithesis and so on. Thus, according to the 
translation model alone, the most highly ranked candidate will be the com¬ 
position of the most highly ranked words, namely 2000 countermeasures. 
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But, when the contribution of the fluency model, perhaps a bigram model, is 
factored in, the candidate translation dealing with Y2K will have the highest 
overall score. 

Of course, more complex translations models are possible: anything 
that generates multiple translations with a ranking associated with each. It 
is even possible to do “multi-engine” translation, where several translation 
models (for example a powerful but brittle interlingua-based one and a robust 
but low-quality direct one) are run in parallel to generate various translations 
and translation fragments, with the final output determined by assembling 
the pieces which have highest confidence scores (Brown and Frederking, 
1995). 

Search 

So far we have a theory of which sentence is best, but not of how to find it. 
Since the number of possible translations is enormous, we must find the best 
output without actually generating the infinite set of all possible translations. 
But this is just a decoding problem, of the kind we have seen how to solve via 
the pruned Viterbi (beam-search) and A* algorithms of Chapter 7. For MT 
this decoding is done in the usual way: outputs (translations) are generated 
incrementally, and evaluated at each point. If at any point the probability 
drops below some criterion that line of attack is pruned. Generation can be 
left to right or outward from heads. 

Good introductions to statistical MT include (Brown et al , 1990b) and 
(Knight, 1997). One of the most influential recent systems is described in 
(Knight et al, 1994). 


21.6 Usability and System Development 


Since MT systems are generally run by human operators, the human is avail¬ 
able to help the machine. One way to use human intervention is interac¬ 
tively; that is, when the system runs into a problem, it can ask the user. 
For example, a system given the input the chicken are ready to eat could 
generate paraphrases of both possible meanings, and present the user with 
those alternatives, for example, asking her to decide whether the sentence 
means the chicken are ready to be eaten or the chicken are ready to eat 
something. It turns out that this is incredibly annoying — users do not like 
to have to answer questions from a computer, or to feel that they exist to help 
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the computer get its work done (Cooper, 1995). On the other hand, people 
are comfortable with the job of fixing up poorly-written sentences, and so 
post-editing is the normal mode of human interaction with MT systems. 

People are also able to edit sentences of the source language, and this 
ability can be exploited as way to improve the translatability of the input by 
simplifying it in various ways. Such pre-editing can be more cost-effective pre-editing 
than post-editing if a single document needs to be translated into several lan¬ 
guages, since the cost of pre-editing can then be amortized over many output 
languages — as is often the case for companies which sell things complete 
with documentation, in many countries (Mitamura and Nyberg, 1995). In 
order to decide what needs pre-editing, one way is to apply MT and see 
what comes out wrong, and then go back and rewrite those sentences in the 
original. Another way is to have a model of what MT ought to handle, and 
require input sentences to be rewritten in that sublanguage, for example, by 
disallowing PPs which could attach ambiguously. If such a model exists, the 
pre-editing phase can actually be dispensed with, by training the technical 
writers to only write in simple, unambiguous controlled language, a version language EC 
of English that passes the constraints of the sublanguage grammar checker. 

Doing so may also make the source language text more understandable. This 
is interesting as a case where focusing on the larger task (getting information 
from tech writers to customers), rather than the problem as originally posed 
(to translate some existing documents), leads to improvements of the entire 
process. 

In general, user satisfaction is vital for MT systems. Various evaluation 
metrics are used to predict acceptability. Evaluation metrics for MT intended 
to be used raw (for information acquisition) include the percentage of sen¬ 
tences translated correctly, or nearly correctly, where correctness depends on 
both fidelity and fluency. The typical evaluation metric for MT output to be 
post-edited is edit cost, either relative to some standard translation via some 
automatic measure of edit-distance, similar to those seen in Chapter 7 for 
evaluating speech recognition, or measured directly as the amount of time 
(or number of keystrokes) required to correct the output to an acceptable 
level. 

In general the content words are crucial; users can generally recover 
from scrambled syntax, but having the words translated properly is vital. In 
practice, one of the major advantages of using a MT system is that it handles 
most of the tedious work of looking up words in bilingual dictionaries. 6 As a 


6 MT systems can also save time typing in the target language word, especially for transla- 
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result, professional MT users put great value on dictionary size and quality. 
Such users typically augment the basic system dictionary with the purchase 
of a domain-specific dictionary designed for the type of translation work 
they do: medical, electronic, financial, military intelligence etc. But no off- 
the-shelf dictionary, even one developed from a corpus of texts in the proper 
domain area, is more than an approximation to the dictionary needed by 
a specific customer, and so established translation bureaus typically invest 
substantial effort in augmenting the system dictionaries with entries of their 
own. The structure of these dictionaries is simple because the specialist 
terminology of any field is generally unambiguous — a photon is a photon is 
a photon, no matter what context it comes up in — and because terminology 
is almost invariably open-class words, with no syntactic idiosyncrasies. 

It has also become apparent that MT systems do better if the dictionar¬ 
ies include not only words but also idioms, fixed phrases, and even frequent 
clauses and sentences. Such data can sometimes be extracted automatically 
from corpora. Moreover, in some situations it may be valuable to do this 
on-line, at translation time, rather than saving the results in a dictionary — 
this is they key idea behind Example-based Machine Translation (Sumita 
andlida, 1991; Brown, 1996). 

User satisfaction also turns out to depend on factors other than the ac¬ 
tual quality of the translation. Many users care less about output quality than 
other factors, such as cost, speed, storage requirements, the ability to run 
transparently inside their favorite editor, the ability to preserve SGML tags, 
and so on. Translation memory, the ability to store and recall previously 
corrected translations, is also a big selling point. 

Although for expository purposes the previous sections have focussed 
on a few basic problems that arise in translation, it is important to realize 
that these far from exhaust the things that MT systems have to worry about. 
As Section 21.1 may have suggested, language differences are a virtually 
inexhaustible source of complexity; and if you were reading the footnotes in 
the previous sections, you may have been annoyed that every “fact” we men¬ 
tioned about a language was actually an oversimplification. Indeed, much of 
the work developing a MT system is down in the weeds, dealing with details 
like this, regardless of the overall system architecture chosen. Furthermore, 
adding more knowledge does not always help, since a working MT system, 
like any huge software system, is a large, delicate piece of code. Improve¬ 
ment to the treatment of one phenomenon, or a correction of a bug in the 


tions into Chinese and Japanese, where it is time-consuming to enter characters. 
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translation of one sentences, can cause other sentences, previously translated 
correctly, to go awry. 

Given all this, it is surprising that MT systems so as well as they do. 
One development technique of proven value is iterative development: build 
it, evaluate it in actual use, improve it, and repeat. In the course of this 
process the MT system is adapted to a domain, to the working habits of its 
users, and to the needs of the consumers of the output. 

21.7 Summary 

• Although MT systems exploit many standard language-processing tech¬ 
niques, there are also some MT-specific ones, including notably syn¬ 
tactic transformations. 

• We have presented four models for MT, the transfer, interlingua, di¬ 
rect, and statistical approaches. Practical MT systems today, how¬ 
ever, typically combine ideas from several of these models; while MT 
research systems are probing other niches in the design space. 

• MT system design is hard work, requiring careful selection of models 
and algorithms and combination into a useful system. Today this is 
more a craft than a science, especially since this must be done while 
minimizing development cost. 

• While MT system design today is thus fairly ad hoc, there are ongoing 
efforts to develop useful formal models of translation (Alshawi et al , 
1998; Knight and Al-Onaizan, 1998; Wu and Wong, 1998). 

• While the possibilities for improvement for MT is truly impressive, 
the output of today’s systems is acceptable for rough translations 
for information-acquisition purposes, draft translations intended to 
be post-edited by a human translator, and translation for sublanguage 
domains. 

• As for many software tasks, user interface issues in MT are crucial; the 
value of MT systems to users is not directly related to the sophistication 
of their algorithms or representations, nor even necessarily to output 
quality. 

• Despite half a century of research, MT is far from solved. Human 
language is a rich and fascinating area whose treasures have only begun 
to be explored. 
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Bibliographical and Historical Notes 


Work on models of the process and goals of translation goes back at least to 
Saint Jerome in the fourth century (Kelley, 1979). The development of log¬ 
ical languages, free of the imperfections of human languages, for reasoning 
correctly and for communicating truths and thereby also for translation, has 
been pursued at least since the 1600s (Hutchins, 1986). 

By the late 1940s, scant years after the birth of the electronic computer, 
the idea of MT was raised seriously (Weaver, 1955a). In 1954 the first public 
demonstration of a MT system prototype (Dostert, 1955) led to great excite¬ 
ment in the press (Hutchins, 1997). The next decade saw a great flowering of 
ideas, prefiguring most subsequent developments. But this work was ahead 
of its time — implementations were limited by, for example, the fact that 
pending the development of disks there was no good way to store dictionary 
information. 

As high quality MT proved elusive (Bar-Hillel, 1960), a growing con¬ 
sensus on the need for more basic research in the new fields of formal and 
computational linguistics led in the mid 1960s to a dramatic cut in fund¬ 
ing for MT research. As MT research lost academic respectability, the As¬ 
sociation for Machine Translation and Computational Linguistics dropped 
MT from its name. Some MT developers, however, persevered, slowly 
and steadily improving their systems, and slowly gamering more customers. 
Systran in particular, developed initially by Peter Toma, has been contin¬ 
uously improved over 40 years. Its earliest uses were for information ac¬ 
quisition, for example by the US Air Force for Russian documents; and in 
1976 an English-French edition was adopted by the European Community 
for creating rough and post-editable translations of various administrative 
documents. Our translation example in the introduction was produced using 
the free Babelfish version of Systran on the Web. Another early successful 
MT system was Meteo, which translated weather forecasts from English to 
French; incidentally, its original implementation (1976), used “Q-systems”, 
an early unification model. 

The late 1970s saw the birth of another wave of academic interest in 
MT. One source of excitement was the possibility of using Artificial Intel¬ 
ligence techniques ideas, originally developed for story understanding and 
knowledge engineering (Carbonell et al, 1981). This interest in meaning- 
based techniques was also a reaction to the dominance of syntax in computa- 
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tional linguistics at that time. Another motivation for the use of interlingual 
models was their introspective plausibility: the idea that MT systems should 
translate as people do (presuming that people translate by using their abil¬ 
ity to understand). Introspection here may be misleading, since the process 
of human translation is enormously complex and furthermore the relevance 
for machine translation is unclear. Concerns about such issues were much 
discussed in the late 1980s and early 1990s Tsujii (1986), Nirenburg et al 
(1992), Ward (1994), Carbonell et al (1992). Meanwhile MT usage was 
increasing, fueled by the increase in international trade and the growth of 
governments with policies requiring the translation of all documents into 
multiple official languages, and enabled by the proliferation of word proces¬ 
sors, and then personal computers, and then the World Wide Web. 

The 1990s saw the application of statistical methods, enabled by the 
development of large corpora. Excitement was provided by the “grand chal¬ 
lenge” of building speech-to-speech translation systems (Kay et al , 1992; 
Bub et al , 1997; Frederking et al , pear) where MT catches up with the mod¬ 
em vision of computers being embedded, ubiquitous and interactive. On the 
practical side, with the growth of the user population, user’s needs have had 
an increasing effect on priorities for MT research and development. 

Good surveys of the early history of MT are Hutchins (1986) and 
(1997). The textbook by Hutchins and Somers (1992) includes a wealth 
of examples of language phenomena that make translation difficult, and ex¬ 
tensive descriptions of some historically significant MT systems. 

Academic papers on machine translation appear in the journal Machine 
Translation and in the proceedings of the biennial (odd years) Conferences 
on Theoretical and Methodological Issue in Machine Translation. 

Reports on systems, markets, and user experiences can be found in MT 
News International , the newsletter of the International Association for Ma¬ 
chine Translation, which is the umbrella organization for the three regional 
MT societies: the Association for MT in the Americas, the Asian-pacific 
Association for MT, and the European Association for MT. These societies 
have annual meetings which bring together developers and users. The pro¬ 
ceedings of the biennial MT Summit (odd years) are also often published. 
The mainstream computational linguistics journals and conferences also oc¬ 
casionally report work in machine translation. 
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Exercises 

21.1 Select at random a paragraph of Chapter 9 which describes a fact 
about English syntax, a) Describe and illustrate how your favorite foreign 
language differs in this respect, b) Explain how a MT system could deal 
with this difference. 

21.2 Go to the literature section of the library, and find a foreign language 
novel in a language you know. Copy down the shortest sentence on the first 
page. Now look up the rendition of that sentence in an English translation of 
the novel, a) For both original and translation, draw parse trees, b) For both 
original and translation, draw dependency structures, c) Draw a case struc¬ 
ture representation of the meaning which the original and translation share, 
d) What does this exercise suggest to you regarding intermediate representa¬ 
tions for MT? 

21.3 Pick a word from the first sentence of the top article of today’s news¬ 
paper. a) List the possible equivalents found in a bilingual dictionary, b) 
Sketch out how a MT system could choose the appropriate translation to use 
based on the context of occurrence, c) Sketch out how this could be done 
without using contrastive knowledge. 

21.4 The idea of example-based MT can be extended to “translation by 
analogy” (Sato and Nagao, 1990). a) Given the bilingual data in Figure 21.11, 
what Japanese word do you think would be appropriate as a translation of on 
in research on gastropodsl b) Specify an algorithm for doing lexical transfer 
in this way. c) How is your approach similar to choice of TL words by using 
a TL language model (Section 21.5)? d) How is it similar to disambiguation 
using semantic features as in Chapter 16? 


the cat on the mat 

no ue no 

more notes on decision making 

ni tsuite 

pink frosting on the cake 

no 

see boats on the pond 

no, ni 

always reading on the bus 

de 

Figure 21.11 A mini-corpus of made-up phrases involving on and their 
Japanese translations 


21.5 Type a sentence into a MT system (perhaps a free demo on the Web) 
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and see what it outputs, a) List the problems with the translation, b) Rank 
these problems in order of severity, c) For the two most severe problems, 
suggest the probable root cause. 

21.6 Since natural languages are hard to deal with, due to ambiguities, ir¬ 
regularities, and other complexities, it is much nicer to work with something 
with is more logical: something that does not have these ‘flaws’ of natu¬ 
ral language. As a result, various notations which are (in some ways) less 
ambiguous or more regular than English have been proposed. In addition 
to various meaning representation schemes, natural languages such as Es¬ 
peranto and Sanskrit, have also been proposed for use as interlinguas for 
machine translation. Is this a good idea? Why or why not? 

21.7 Consider the types of ‘understanding’ needed: 1. for a natural lan¬ 
guage interface to a database, as seen in Chapter 15. 2. for an information 
extraction program, as seen in Chapter 15. 3. for a MT system. Which of 
these requires a deeper understanding? In what way? 

21.8 Choose one of the generation techniques introduced in Chapter 20 
and explain why it would or would not be useful for MT. 

21.9 Version 1 (for native English speakers): Consider the following sen¬ 
tence: 


These lies are like their father that begets them; gross as a mountain, 
open, palpable. 

Henry IV, Part 1, act 2, scene 2 

Translate this sentence into some dialect of modem vernacular English. 
For example, you might translate it into the style of a New York Times edi¬ 
torial or an Economist opinion piece, or into the style of your favorite tele¬ 
vision talk-show host. 

Version 2 (for native speakers of other languages): Translate the fol¬ 
lowing sentence into your native language. 

One night my friend Tom, who had just moved into a new apartment, 
saw a cockroach scurrying about in the kitchen. 

For either version, now: 

a) Describe how you did the translation: What steps did you perform? 
In what order did you do them? Which steps took the most time? b) Could 
you write a program that would translate using the same methods that you 
did? Why or why not? c) What aspects were hardest for you? Would they 
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be hard for a MT system? d) What aspects would be hardest for a MT sys¬ 
tem? are they hard for people too? e) Which models are best for describing 
various aspects of your process (direct, transfer, interlingua or statistical)? f) 
Now compare your translation with those produced by friends or classmates. 
What is different? Why were the translations different? 

21.10 Newspaper reports of MT systems invariably include an example of 
a sentence, typically a proverb, that when translated from English to lan¬ 
guage X, and then back to English, came out funny, a) Is this evidence that 
at least one of the two MT systems was bad? b) Why does this problem not 
arise with human translators? or does it? c) On the other hand, does a suc¬ 
cessful translation to a foreign language and back indicate that the system is 
doing well? 

21.11 Set yourself an information acquisition task: for example, to find a 
World-Wide Web page in your favorite foreign language reviewing a recent 
movie, and discover what the reviewer thought. Accomplish this task using 
one or two of the Web’s machine translation providers, a) Give two exam¬ 
ples each of correct and incorrect translations you encountered, b) Come up 
with a simple quality metric for rating the MT output, and use it to evalu¬ 
ate the MT systems you tried, c) Were you able to find a page of the kind 
you wanted? d) Were you able to figure out whether the reviewer liked the 
movie? e) Were the scores on your quality metric predictive of your answers 
to (c) and (d)? 

21.12 Consider each of the following as an application for machine trans¬ 
lation. Rank the difficulty of each from 1 (easy) to 4 (very very hard). Also, 
for each task, say briefly what makes it easy or hard. 

a. letters between an American girl and her Chinese pen-pal 

b. electronic junk mail 

c. articles in chemistry journals 

d. magazine advertisements 

e. children’s storybooks 

f. history books 

g. an English-speaker wanting to read articles in Japanese newsgroups 

h. an English-speaker wanting to post articles to a Japanese newsgroup 




REGULAR EXPRESSION 
OPERATORS 


Perl 

grep 

MS Word 

Description 

Single character expressions 

\... 

\... 

\... 

a special character 

. 

. 

9 

any single character 

[...] 

[...] 

[...] 

any single character listed 




any single character in the range 


r...] 

[!...] 

any single character not listed 




any single character not in the range 

Anchors/Expressions which match positions 




beginning of line 

$ 

$ 

$ 

end of line 

\b 

- 

- 

word boundary 

\B 

- 

- 

word non-boundary 

- 

\< 

< 

start of word 

- 

\> 

> 

end of word 

Counters/Expressions which quantify previous expressions 

* 

* 

- 

zero or more of previous r.e. 

+ 

- 

@ 

one or more of previous r.e. 

? 

- 

- 

exactly one or zero of previous r.e. 

{n} 

\{n\} 

{n} 

n of previous r.e. 

{n,m} 

\ {n, m\} 

{n, m} 

from n to m of previous r.e. 

{n, } 

\{n,\} 

{n, } 

at least n of previous r.e. 

Figure A.l Basic regular expressions 
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Appendix A. Regular Expression Operators 


Perl 

grep 

MS Word 

Description 

Other 

. * 

. * 

* 

any string of characters 

... |... 

- 

~ 

or - matches either r.e. 

(...) 

\(...\) 

(...) 

grouping, memory 

Shortcuts 

\d 

[0-9] 

[0-9] 

any digit 

\D 

["0-9] 

["0-9] 

any non-digit 

\w 

[a-zA-Z0-9,J 

[a-zA-Z0-9,J 

any alphanumeric/space 

\W 

[ "a-zA-Z0-9 1 _ 1 ] 

[ "a-zA-Z0-9,J 

any non-alphanumeric 

\s 

[^XrXtXnXf] 

- 

whitespace (space, tab) 

\S 

[ "^XrXtXnXf ] 

- 

non-whitespace 

Figure A.2 More regular expressions 




THE PORTER STEMMING 
ALGORITHM 


For the purposes of the Porter (1980) algorithm we define a consonant as a 
letter other than A, E, I, O, and U, and other than Y preceded by a consonant. 
Any other letter is a vowel. (This is of course just an orthographic approxi¬ 
mation.) Let c denote a consonant and v denote a vowel. C will stand for a 
string of one or more consonants, and Y for a string of one or more vowels. 
Any written English word or word part can be represented by the follow¬ 
ing regular expression (where the parentheses () are used to mark optional 
elements): 

(cwcnv) 

For example the word troubles maps to the following sequence: 

troubles 
C V C VC 


with no final Y. We call the Kleene operator m the measure of any word or 
word part; the measure correlates very roughly with the number of syllables 
in the word or word part. Some examples: 


m=() 

m=l 

m=2 


TR, EE, TREE, Y, BY 
TROUBLE, OATS, TREES, IVY 
TROUBLES, PRIVATE, OATEN, ORRERY 


The rules that we will present below will all be in the following format: 


(condition) SI S2 


meaning “if a word ends with the suffix SI, and the stem before SI satisfies 
the condition, SI is replaced by S2”. Conditions include the following and 
any boolean combinations of them: 
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Appendix B. The Porter Stemming Algorithm 


m 

the measure of the stem 

*S 

the stem ends with S (and similarly for other letters) 

^ y ^ 

the stem contains a vowel 

*d 

the stem ends with a double consonant (e.g. -TT, -SS) 

*0 

the stem ends CVC, where the second c is 


not W, X, or Y (e.g. -WIL, -HOP) 


The Porter algorithm consists of seven simple sets of rules, applied in 
order. Within each step, if more than one of the rules can apply, only the one 
with the longest matching suffix (SI) is followed. 

Step 1: Plural Nouns and Third Person Singular Verbs 

The rules in this set do not have conditions: 


SSES -> SS 

caresses — > caress 

IES -$• I 

ponies —> poni 

ties —> ti 

SS -»■ SS 

caress caress 

S -»• e 

cats cat 


Step 2a: Verbal Past Tense and Progressive Forms 


(m> 1) EED -»• EE 

feed feed 

agreed agree 

(*v*) ED —> £ 

plastered —» plaster 
bled —>■ bled 

(*v*) ING —> £ 

motoring —> motor 
sing —> sing 


Step 2b: Cleanup 

If the second or third of the rules in 2a is successful, we run the following 
rules (that remove double letters and put the E back on -ATE/-BLE) 
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AT ATE 

conflat(ed) —> conflate 

BL BLE 

troubl(ing) —> trouble 

IZ IZE 

siz(ed) —> size 

(*d & !(*L or *S or *Z)) —y single letter 

hopp(ing) hop 

tann(ed) ->■ tan 

fall(ing) -> fall 

hiss(ing) hiss 

fizz(ed) —> fizz 

(m=l&*o) -> E 

fail(ing) —> fail 

fil(ing) —> file 


Step 3: Y —> I 


(*v*) Y —*• I 


happy —» happi 
sky —> sky 


Step 4: Derivational Morphology I: Multiple suffixes 


(m > 0) 

ATTONAL 


ATE 

relational 


relate 

(m > 0) 

TIONAL 


TION 

conditional 

->* 

condition 





rational 


rational 

(m > 0) 

ENCI 


ENCE 

valenci 


valence 

(m > 0) 

ANCI 

->- 

ANCE 

hesitanci 

-► 

hesitance 

(m > 0) 

IZER 


IZE 

digitizer 


digitize 

(m > 0) 

ABU 


ABLE 

conformabli 


conformable 

(m > 0) 

ALU 


AL 

radicalli 


radical 

(m > 0) 

ENTLI 


ENT 

differentli 


different 

(m > 0) 

ELI 


E 

vileli 


vile 

(m > 0) 

OUSLI 


OUS 

analogousli 


analogous 

(m > 0) 

IZATION 


IZE 

vietnamization 


vietnamize 

(m > 0) 

ATTON 

-> 

ATE 

predication 

-> 

predicate 

(m > 0) 

ATOR 


ATE 

operator 


operate 

(m > 0) 

ALISM 


AL 

feudalism 


feudal 

(m > 0) 

IVENESS 

->- 

IVE 

decisiveness 

-► 

decisive 

(m > 0) 

FULNESS 


FUL 

hopefulness 


hopeful 

(m > 0) 

OUSNESS 


OUS 

callousness 


callous 

(m > 0) 

ALIE 


AL 

formaliti 


formal 

(m > 0) 

IVITI 

->- 

IVE 

sensitiviti 

-► 

sensitive 

(m > 0) 

BILITI 


BLE 

sensibiliti 


sensible 
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Step 5: Derivational Morphology II: More multiple suffixes 


(m > 0) ICATE -4 IC 
(m > 0) ATIVE -4 £ 
(m > 0) ALIZE -4 AL 
(m > 0) ICITI -4 IC 
(m > 0) FUL -4 £ 
(m > 0) NESS -4 £ 


triplicate -4 triplic 
formative —> form 
formalize -4 formal 
electriciti -4 electric 
hopeful —> hope 
goodness -4 good 


Step 6: Derivational Morphology III: single suffixes 


(m > 1) AL -4 £ 

(m > 1) ANCE -4 £ 

(m > 1) ENCE -4 £ 

(m > 1) ER -4 £ 

(m > 1) IC -4 8 

(m > 1) ABLE -4 £ 

(m > 1) ANT -4 £ 

(m > 1) EMENT -4 £ 

(m > 1) MENT -4 £ 

(m > 1) ENT -4 £ 

(m > 1) (*S or *T) & ION -4 £ 

(m > 1) OU -4 8 

(m > 1) ISM -4 £ 

(m > 1) ATE -4 £ 

(m > 1) ITI -4 £ 

(m > 1) OUS -4 £ 

(m > 1) IYE -4 £ 

(iti 1) IZE —^ £ 


revival 


reviv 

allowance 


allow 

inference 


infer 

airliner 


airlin 

gyroscopic 


gyroscop 

defensible 

-> 

defens 

irritant 

-y 

irrit 

replacement 


replac 

adjustment 

-> 

adjust 

dependent 


depend 

adoption 


adopt 

homologou 


homolog 

communism 


commun 

activate 


activ 

angulariti 


angular 

homologous 


homolog 

effective 

-> 

effect 

bowdlerize 


bowdler 


Step 7a: Cleanup 


(m > 1) E —>■ e 

probate -4 probat 
rate -4 rate 

(m = 1 & ! *o) E -4 £ 

cease -4 ceas 


Step 7b: Cleanup 


(m > 1 & *d *L) 


-4 [single letter] 


controll -4 control 
roll -4 roll 




C5 AND Cl TAGSETS 


Tag 

Description 

Example 

AJO 

adjective (unmarked) 

good , old 

AJC 

comparative adjective 

better , older 

AJS 

superlative adjective 

best, oldest 

ATO 

article 

the , a, an 

AVO 

adverb (unmarked) 

often, well y longer, furthest 

AVP 

adverb particle 

up, off, out 

AYQ 

wh-adverb 

when, how, why 

CJC 

coordinating conjunction 

and, or 

CJS 

subordinating conjunction 

although, when 

CJT 

the conjunction that 


CRD 

cardinal numeral (except one) 

3 , twenty-five , 734 

DPS 

possessive determiner 

your, their 

DTO 

general determiner 

these, some 

DTQ 

wh-determiner 

whose, which 

EXO 

existential there 


ITJ 

interjection or other isolate 

oh, yes, mhm 

NNO 

noun (neutral for number) 

aircraft, data 

NN1 

singular noun 

pencil, goose 

NN2 

plural noun 

pencils, geese 

NPO 

proper noun 

London, Michael, Mars 

ORD 

ordinal 

sixth, 77th, last 

PNI 

indefinite pronoun 

none, everything 

PNP 

personal pronoun 

you, them, ours 

PNQ 

wh-pronoun 

who, whoever 

Figure C. l First half of UCREL C5 Tagset for the British National Corpus 

(BNC) after Garside et al (1997). 
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Tag 

Description 

Example 

PNX 

reflexive pronoun 

itself ourselves 

POS 

possessive \s or ’ 


PRF 

the preposition of 


PRP 

preposition (except of) 

for, above, to 

PUL 

punctuation - left bracket 

(or [ 

PUN 

punctuation - general mark 

f * * - 9 
. . , . , .... 

PUQ 

punctuation - quotation mark 

t 9 

PUR 

punctuation - right bracket 

) or ] 

TOO 

infinitive marker to 


UNC 

unclassified items (not English) 


YBB 

base forms of be (except infinitive) 

am, are 

VBD 

past form of be 

was, were 

VBG 

-ing form of be 

being 

VBI 

infinitive of be 


VBN 

past participle of be 

been 

VBZ 

-s form of be 

is, ’s 

YDB 

base form of do(ex cept infinitive) 

does 

YDD 

past form of do 

did 

YDG 

-ing form of do 

doing 

VDI 

infinitive of do 

to do 

VDN 

past participle of do 

done 

VDZ 

-s form of do 

does 

VHB 

base form of have (except infinitive) 

have 

VHD 

past tense form of have 

had, } d 

VHG 

-ing form of have 

having 

VHI 

infinitive of have 


VHN 

past participle of have 

had 

VHZ 

-s form of have 

has, ’s 

VMO 

modal auxiliary verb 

can, could, will, 11 

VVB 

base form of lexical verb (except infin.) 

take, live 

VVD 

past tense form of lexical verb 

took, lived 

VVG 

-ing form of lexical verb 

taking, living 

VVI 

infinitive of lexical verb 

take, live 

VVN 

past participle form of lex. verb 

taken, lived 

vvz 

-s form of lexical verb 

takes, lives 

xxo 

the negative not or n Y 


zzo 

alphabetical symbol 

A, B, c, d 

Figure C.2 The rest of UCREL’s C5 Tagset (Garside et al, 1997). 
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Tag 

Description 

Example 

! 

punctuation tag - exclamation mark 



punctuation tag - quotation marks 


( 

punctuation tag - left bracket 


) 

punctuation tag - right bracket 
punctuation tag - comma 


- 

punctuation tag - dash 



new sentence marker 
punctuation tag - full-stop 
punctuation tag - ellipsis 
punctuation tag - colon 
punctuation tag - semi-colon 


7 

punctuation tag - question-mark 


APPGE 

possessive pronoun, prenominal 

my, your, our etc. 

AT 

article 

the, no 

ATI 

singular article 

a, an, every 

BCL 

before-clause marker 

in order [that] 

cc 

coordinating conjunction 

and, or 

CCB 

coordinating conjunction 

but 

CS 

subordinating conjunction 

if,', because, unless 

CSA 

as as a conjunction 


CSN 

than as a conjunction 


CST 

that as a conjunction 


csw 

whether as a conjunction 


DA 

post-determiner/pronoun 

such, former, same 

DAI 

singular after-determiner 

little, much 

DA2 

plural after-determiner 

few, several, many 

DAR 

comparative after-determiner 

more, less 

DAT 

superlative after-determiner 

most, least 

DB 

pre-determiner/pronoun 

all, half 

DB2 

plural pre-determiner/pronoun 

both 

DD 

determiner/pronoun 

any, some 

DD1 

singular determiner 

this, that, another 

DD2 

plural determiner 

these, those 

DDQ 

wh-determiner 

which, what 

DDQGE 

wh-determiner, genitive 

whose 

DDQY 

wh-ever determiner 

whichever, whatever 

EX 

existential there 


FO 

formula 


FU 

unclassified 


Figure C.3 First part of UCREL C7 Tagset for the British National Corpus 

(BNC) from (Garside et al, 1997). 
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Tag 

Description 

Example 

FW 

foreign word 


GE 

germanic genitive marker - 

’ or ’s 

IF 

for as a preposition 


II 

preposition 

in , on, to 

10 

of as a preposition 


IW 

with; without as preposition 


JJ 

general adjective 

big, old 

JJR 

general comparative adjective 

older, better, bigger 

JJT 

general superlative adjective 

oldest, best, biggest 

JK 

adjective catenative 

able in be able to 
willing in be willing to 

MC 

cardinal number (neutral for number) 

two, three ... 

MCI 

singular cardinal number 

one 

MC2 

plural cardinal number 

tens, twenties 

MCMC 

hyphenated number 

40-50 , 1770-1827 

MD 

ordinal number 

first, 2nd, next, last 

ND1 

singular noun of direction 

north, southeast 

NN 

common noun (neutral for number) 

sheep, cod 

NN1 

singular common noun 

book, girl 

NN2 

plural common noun 

books, girls 

NNA 

following noun of title 

M.A. 

NNB 

preceding noun of title 

Mr, Prof 

NNL1 

singular locative noun 

street. Bay 

NNL2 

plural locative noun 

islands, roads 

NNO 

numeral noun (neutral for number) 

dozen, thousand 

NN02 

plural numeral noun 

hundreds, thousands 

NNT 

temporal noun (neutral for number) 

no known examples 

NNT1 

singular temporal noun 

day, week, year 

NNT2 

plural temporal noun 

days, weeks, years 

NNU 

unit of measurement 
(neutral for number) 

in., cc. 

NNU1 

singular unit of measurement 

inch, centimetre 

NNU2 

plural unit of measurement 

inches, centimetres 

NP 

proper noun (neutral for number) 

Phillipines, Mercedes 

NP1 

singular proper noun 

London, Jane, Frederick 

NP2 

plural proper noun 

Browns, Reagans, Koreas 

NPD1 

singular weekday noun 

Sunday 

NPD2 

plural weekday noun 

Sundays 

Figure C.4 More of UCREL’s C7 Tagset (Garside et al, 1997). 
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Tag 

Description 

Example 

NPM1 

singular month noun 

October 

NPM2 

plural month noun 

Octobers 

PN 

indefinite pronoun (neutral for number) 

none 

PN1 

singular indefinite pronoun 

one , everything, nobody 

PNQO 


whom 

PNQS 


who 

PNQY 


whoever , whomever 
whomsoever , whosoever 

PNX1 

reflexive indefinite pronoun 

oneself 

PPGE 

nominal possessive personal pronoun 

mine, yours 

PPH1 


it 

PPHOl 


him, her 

PPH02 


them 

PPHS1 


She, she 

PPHS2 


they 

PPIOl 


me 

PPI02 


us 

PPIS1 


I 

PPIS2 


we 

PPX1 

singular reflexive personal pronoun 

yourself itself 

PPX2 

plural reflexive personal pronoun 

yourselves, ourselves 

PPY 


you 

RA 

adverb, after nominal head 

else, galore 

REX 

adverb introducing 
appositional constructions 

namely, viz, eg. 

RG 

degree adverb 

very , so , too 

RGQ 

wh- degree adverb 

how 

RGQY 

wh-ever degree adverb 

however 

RGR 

comparative degree adverb 

more, less 

RGT 

superlative degree adverb 

most , least 

RL 

locative adverb 

alongside, forward 

RP 

prepositional adverb; particle 

in, up, about 

RPK 

prepositional adverb, catenative 

about in be about to 

RR 

general adverb 

actually 

RRQ 

wh- general adverb 

where, when, why, how 

RRQY 

wh-ever general adverb 

wherever , whenever 

RRR 

comparative general adverb 

better , longer 

RRT 

superlative general adverb 

best, longest 

RT 

nominal adverb of time 

now, tommorow 

Figure C.5 More of UCREL’s C7 Tagset (Garside et al. , 1997). 
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Tag 

Description 

Example 

TO 

infinitive marker 

to 

UH 

interjection 

oh, yes, urn 

VBO 


be 

YBDR 


were 

YBDZ 


was 

YBG 


being 

YBI 

infinitive be 


YBM 


am 

YBN 


been 

YBR 


are 

YBZ 


is 

YDO 


do 

YDD 


did 

YDG 


doing 

YDI 

infinitive do 


YDN 


done 

YDZ 


does 

YHO 


have 

YHD 

past tense had 


YHG 


having 

VHI 

infinitive have 


YHN 

past participle had 


YHZ 


has 

YM 

modal auxiliary 

can, will, would t tc. 

YMK 

modal catenative 

ought, used 

YYO 

base form of lexical verb 

give, work etc. 

YYD 

past tense form of lexical verb 

gave, worked etc. 

YYG 

-ing form of lexical verb 

giving, working etc. 

YYGK 

-ing form in a catenative verb 

going in be going to 

YYI 

infinitive of lexical verb 

[to] give, [to] work etc. 

YYN 

past participle form of lexical verb 

given, worked etc. 

YYNK 

past part, in a catenative verb 

bound in be bound to 

YYZ 

-s form of lexical verb 

gives, works etc. 

XX 


not, n’t 

ZZ1 

singular letter of the alphabet 

A, a, B, etc. 

ZZ2 

plural letter of the alphabet 

As, b’s, etc. 

Figure C.6 The rest of UCREL’s C7 Tagset (Garside el al., 1997) 




TRAINING HMMS: THE 
FORWARD-BACKWARD 
ALGORITHM 


This appendix sketches the forward-backward or Baum-Welch algorithm 
(Baum, 1972), a special case of the Expectation-Maximization or EM al¬ 
gorithm (Dempster etal, 1977). The algorithm will let us train the transition 
probabilities and the emission probabilities b;(o t ) of the HMM. While it 
is theoretically possible to train both the network structure of an HMM and 
these probabilities, no good algorithm for this double-induction exists. Thus 
in practice the structure of most HMMs is designed by hand, and then the 
transition and emission probabilities are trained from a large set of observa¬ 
tion sequences O. Furthermore, it turns out that the problem of setting the 
a and b parameters so as to exactly maximize the probability of the obser¬ 
vation sequence O is unsolved. The algorithm that we give in this section is 
only guaranteed to find a local maximum. The forward-backward algorithm 
is used throughout speech and language processing, for example in training 
HMM-based part-of-speech taggers, as we saw in Chapter 8. Extensions of 
forward-backward are also important, like the Inside-Outside algorithm used 
to train stochastic context-free-grammars (Chapter 12). 

Let us begin by imagining that we were training not a Hidden Markov 
Model but a vanilla Markov Model. We do this by running the model on the 
observation and seeing which transitions and observations were used. For 
ease of description in the rest of this section, we will pretend that we are 
training on a single sequence of training data (called 0), but of course in 
a real speech recognition system we would train on hundreds of thousands 
of sequences (thousands of sentences). Since unlike an HMM, a vanilla 
Markov Model is not hidden, we can look at an observation sequence and 
know exactly which transitions we took through the model, and which state 
generated each observation symbol. Since every state can only generate one 
observation symbol, the observation b probabilities are all 1.0. The proba¬ 
bility aij of a particular transition between states i and j can be computed by 
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FORWARD 

PROBABILITY 

BACKWARD 

PROBABILITY 


counting the number of times the transition was taken, which we could call 
C(i —>/), and then normalizing by the total count of all times we took any 
transition from state L 


ck j — 


C(i —> j ) 


(D.l) 


For an HMM we cannot compute these counts directly from an ob¬ 
served sentence (or set of sentences), since we don’t know which path of 
states was taken through the machine for a given input. The Baum-Welch 
uses two neat intuitions to solve this problem. The first idea is to iteratively 
estimate the counts. We will start with an estimate for the transition and ob¬ 
servation probabilities, and then use these estimated probabilities to derive 
better and better probabilities. The second idea is that we get our estimated 
probabilities by computing the forward probability for an observation and 
then dividing that probability mass among all the different paths that con¬ 
tributed to this forward probability. 

In order to understand the algorithm, we need to return to the forward 
algorithm of Chapter 5 and more formally define two related probabilities 
which will be used in computing the final probability: the forward proba¬ 
bility and the backward probability. We refer to the forward probability 
as a and the backward probability as (3. Recall that we defined the forward 
probability as the probability of being in state i after seeing the first t obser¬ 
vations, given the automaton X: 


at{i)=P(pi,o 2 ...o t ,q t = i\X) 


(D.2) 


In Chapter 5 we used a matrix to calculate the forward probability re¬ 
cursively; now we will formally define the actual recursion. 


1. Initialization: 


otft(l) = aijbj(oi) l < j < N 
2. Recursion (since states 1 and N are non-emitting): 

f/v-i 


(D.3) 


aj(t) = 

3. Termination: 


E af(f-i)< 


i=2 


N -1 


bj{o t ) 1 < j < N, 1 < t < T (D.4) 


(D.5) 


P(0\X) = a N (T)=Y, 0 ii(T)a iN 

i =2 

As we saw in Chapter 5, the forward probability is computed via a 
matrix or lattice, in which each column is computed by extending the paths 



from the previous columns. Figure D. 1 illustrates the induction step for com¬ 
puting the value in one new cell. 



OC.(t) OCj(t+l) 


Figure D.l The computation of a i(t) by summing all the previous values 
otf-i weighted by their transition probabilities a and multiplying by the obser¬ 
vation probability bi{o t + 1 ). Of course in any given HMM many or most of the 
transition probabilities will be 0, so not all previous states will contribute to 
the forward probability of the current state. 


The second important piece of the forward-backward algorithm, the 
backward probability, is almost the mirror image of the forward probability; 
it computes the probability of seeing the observations from time t+ 1 to the 
end, given that we are in state j at time t (and of course given the automaton 

A.): 

P/(o») =P(o l+ i,o t+2 . ..o T \q, = j,X) (D. 6 ) 

It is computed inductively in a similar manner to the forward algorithm. 

1. Initialization: 


P*(0 = a iN , 1 < i < N (D.7) 

2. Recursion (again since states 1 and N are non-emitting): 

N -1 

Pi(0 = L aiMo,+i)h(t + 1) 1 < i < N,T > t > 1 (D.8) 

i =2 

3. Termination: 

P(0\X) = a N (T) = Mr) = ^ 


(D.9) 
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Figure D.2 illustrates the backward induction step. 

We are now ready to understand how the forward and backward prob¬ 
abilities can help us compute the transition probability atj and observation 
probability bi{o t ) from an observation sequence, even though the actual path 
taken through the machine is hidden! 

Let’s begin by showing how to reestimate aij. We will proceed to esti¬ 
mate djj by a variant of (D. 1): 

„ expected number of transitions from state i to state j 
lJ expected number of transitions from state i 

How do we compute the numerator? Here’s the intuition. Assume 
we had some estimate of the probability that a given transition i —> j was 
taken at a particular point in time t in the observation sequence. If we knew 
this probability for each particular time t, we could sum over all times t to 
estimate the total count for the transition i —> j . 

More formally, let’s define the probability % t (x for transition) as the 
probability of being in state i at time t and state j at time t+ 1, given the 
observation sequence and of course the model: 

z t (h j) = P{% = i,q t +1 =j\0,X) (D.ll) 

In order to compute % u we first compute a probability which is similar 
to T,, but differs in including the probability of the observation: 

not-quite-T f (i, j) = P(q t = i, q, +l = j, 0\X) (D. 12) 

Figure D.3 shows the various probabilities that go into computing not- 
quite-X/: the transition probability for the arc in question, the a probability 



before the arc, the P probability after the arc, and the observation probability 
for the symbol just after the arc. 
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Or a.(t) 

°t_i " O t 

a u* b j(° t+ i) 

l 

\ 

\ 

\ 

\ 

\ 

a 

pj(t+i) 

0 t+l 0 1+2 

Figure D.3 Computation of the joint probability of being in state i at time 
t and state j at time t + 1. The figure shows the various probabilities that need 
to be combined to produce P{q t = i,q t +\ — h 0\X): the a and P probabilities, 
the transition probability and the observation probability bj{o t + 1 ). After 

Rabiner (1989). 


These are multiplied together to produce not-quite-H t as follows 

not-quite-x, (i, j) = a/(f)flyfy(0,+i)|5/(f +1) (D.13) 

In order to compute x t from not-quite-x t , the laws of probability in¬ 
struct us to divide by P(0 |X), since: 




(D.14) 


P(0\k 

The probability of the observation given the model is simply the for¬ 
ward probability of the whole utterance, (or alternatively the backward prob¬ 
ability of the whole utterance!), which can thus be computed in a number of 
ways: 

P(0\X) = a n (T) = [3 l (T) = £ (D.15) 

;=i 

So, the final equation for % t is: 

T,(.,/I = a (D. 16 ) 

The expected number of transitions from state i to state j is then the 
sum over all t of x. For our estimate of atj in (D.10), we just need one more 
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thing: the total expected number of transitions from state i. We can get this 
by summing over all transitions out of state L Here’s the final formula for 

&i j : 


j — 


I f =l ^(U) 


(D.17) 


We also need a formula for recomputing the observation probability. 
This is the probability of a given symbol vy from the observation vocabulary 
V, given a state j : bj{vk). We will do this by trying to compute: 


bj{vk) 


expected number of times in state j and observing symbol vy 
expected number of times in state j 


(D. 


For this we will need to know the probability of being in state j at time 
t , which we will call <5j{t) (a for state): 


Cj(t)=P(q t =j\0,\) 


(D.19) 


Once again, we will compute this by including the observation se¬ 
quence in the probability: 


Cj{t) 


P(q t = j,Q\\) 
P(0\X) 


(D.20) 



As Figure D.4 shows, the numerator of (D.20) is just the product of the 
forward probability and the backward probability: 


Gj{t) 


W) 


(D.21) 
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We are ready to compute b. For the numerator, we sum Cj(t) for all 
time steps t in which the observation o t is the symbol Vk that we are interested 
in. For the denominator, we sum Gj(t) over all time steps t. The result will 
be the percentage of the times that we were in state j that we saw symbol v* 
(the notation Tj=\ s .t.o t =v k means ’’sum over all t for which the observation at 
time t was v*): 


bj{vk) 


^t=\s.t.O t =Vj i 

L,=i°yV) 


(D.22) 


We now have ways to re-estimate the transition a and observation b 
probabilities from an observation sequence O assuming that we already have 
a previous estimate of a and b. The entire training procedure for HMMs, 
called embedded training, first chooses some estimate for a and b , and then 
uses equations (D.22) and (D.17) to re-estimate a and fo, and the repeats until 
convergence. In the next sections we will see how forward-backward is ex¬ 
tended to inputs which are non-discrete (‘continuous observation densities’) 
via Gaussian functions. Section 7.7 discussed how the embedded training 
algorithm gets its initial estimates for a and b. 


Continuous Probability Densities 

The version of the parameter reestimation that we have described so far sec¬ 
tion assumes that the input observations were discrete symbols from some 
reasonably-sized alphabet. This is naturally true for some uses of HMMs; 
for example Chapter 8 will introduce the use of HMMs for part-of-speech- 
tagging. Here the observations are words of English, which is a reasonably- 
sized finite set, say approximately 100K words. For speech recognition, the 
LPC cepstral features that we introduced constitute a much larger alphabet 
(11 features, each one say a 32-bit floating-point number), for a total vo¬ 
cabulary size of 2( n * 32 ). In fact, since in practice, we usually use not 11 
features, but delta-features and double-delta features as well, the vocabulary 
size would be enormous. Chapter 7 mentioned that one way to solve this 
problem is to cluster or vector quantize the cepstral features into a much 
smaller set of discrete observation symbols. A more effective approach is to 
use either mixtures of Gaussian estimators neural networks (multi-layer 
perceptrons) to estimate a probability density function or pdf over a con¬ 
tinuous space, as we suggested in Chapter 7. 

HMMs with Gaussian observation-probability-estimators are trained 
by a simple extension to the forward-backward algorithm. Recall from Chap¬ 
ter 7 that in the simplest use of Gaussians, we assume that the possible values 
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of the observation feature vector o t are normally distributed, and so we rep¬ 
resent the observation probability function bj(o t ) as a Gaussian curve with 
mean vector jjj and covariance matrix (prime denotes vector transpose): 

b,to,) = 1 ( D .23) 

' >/W 71 

Usually we make the simplifying assumption that the covariance ma¬ 
trix Y,j is diagonal, which means that in practice we are keeping a single 
separate mean and variance for each feature in the feature vector. 

How are the mean and covariance of the Gaussians estimated? It is 
helpful again to consider the simpler case of a non-hidden Markov Model, 
with only one state /. The vector of feature means ja and the vector of covari¬ 
ances E could then be estimated by averaging: 


Pi = (D.24) 

£/ = (D-25) 

1 t = 1 

But since there are multiple hidden states, we don’t know which ob¬ 
servation vector o t was produced by which state. What we would like to 
do is assign each observation vector o t to every possible state /, prorated by 
the probability that the HMM was in state i at time t. Luckily, we already 
know how to do this prorating; the probability of being in state i at time t is 
Of(f), which we saw how to compute above! Of course we’ll need to do the 
probability computation of iteratively since getting a better observation 
probability b will also help us be more sure of the probability o of being in 
a state at a certain time. So the actual re-estimation equations are: 


Pi 

% 


zLi Qi(t)o t 

l/=i -Pi)'(o, -m) 

L = i <*(0 


(D.26) 

(D.27) 


The sums in the denominators are for the same normalization that we 
saw in (D.22). Equations (D.27) and (D.27) are then used in the forward- 
backward (Baum-Welch) training of the HMM. The values of jai and a z * are 
first set to some initial estimate, which is then re-estimated until the numbers 
converge. 
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See Jelinek (1997) or Rabiner and Juang (1993) for a more complete 
description of the forward-backward algorithm. Jelinek (1997) also shows 
the relationship between forward-backward and EM. 
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